Tuesday, November 17, 2015

Python for Data Analysis Part 21: Descriptive Statistics



* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:




Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis. We've already seen several examples of deceptive statistics in earlier lessons, such as means and medians. In this lesson, we'll review some of these functions and explore several new ones.

Measures of Center

Measures of center are statistics that give us a sense of the "middle" of a numeric variable. In other words, centrality measures give you a sense of a typical value you'd expect to see. Common measures of center include the mean, median and mode.
The mean is simply an average: the sum of the values divided by the total number of records. As we've seen in previous lessons we can use df.mean() to get the mean of each column in a DataFrame:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ggplot import mtcars
In [3]:
mtcars.index = mtcars["name"]
mtcars.mean()                 # Get the mean of each column
Out[3]:
mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64
We can also get the means of each row by supplying an axis argument:
In [4]:
mtcars.mean(axis=1)           # Get the mean of each row
Out[4]:
name
Mazda RX4              29.907273
Mazda RX4 Wag          29.981364
Datsun 710             23.598182
Hornet 4 Drive         38.739545
Hornet Sportabout      53.664545
Valiant                35.049091
Duster 360             59.720000
Merc 240D              24.634545
Merc 230               27.233636
Merc 280               31.860000
Merc 280C              31.787273
Merc 450SE             46.430909
Merc 450SL             46.500000
Merc 450SLC            46.350000
Cadillac Fleetwood     66.232727
Lincoln Continental    66.058545
Chrysler Imperial      65.972273
Fiat 128               19.440909
Honda Civic            17.742273
Toyota Corolla         18.814091
Toyota Corona          24.888636
Dodge Challenger       47.240909
AMC Javelin            46.007727
Camaro Z28             58.752727
Pontiac Firebird       57.379545
Fiat X1-9              18.928636
Porsche 914-2          24.779091
Lotus Europa           24.880273
Ford Pantera L         60.971818
Ferrari Dino           34.508182
Maserati Bora          63.155455
Volvo 142E             26.262727
dtype: float64
The median of a distribution is the value where 50% of the data lies below it and 50% lies above it. In essence, the median splits the data in half. The median is also known as the 50% percentile since 50% of the observations are found below it. As we've seen previously, you can get the median using the df.median() function:
In [5]:
mtcars.median()                 # Get the median of each column
Out[5]:
mpg      19.200
cyl       6.000
disp    196.300
hp      123.000
drat      3.695
wt        3.325
qsec     17.710
vs        0.000
am        0.000
gear      4.000
carb      2.000
dtype: float64
Again, we could get the row medians across each row by supplying the argument axis=1.
Although the mean and median both give us some sense of the center of a distribution, they aren't always the same. The median always gives us a value that splits the data into two halves while the mean is a numeric average so extreme values can have a significant impact on the mean. In a symmetric distribution, the mean and median will be the same. Let's investigate with a density plot:
In [6]:
norm_data = pd.DataFrame(np.random.normal(size=100000))

norm_data.plot(kind="density",
              figsize=(10,10))


plt.vlines(norm_data.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.4,
           linewidth=5.0)

plt.vlines(norm_data.median(),   # Plot red line at median
           ymin=0, 
           ymax=0.4, 
           linewidth=2.0,
           color="red")
Out[6]:
<matplotlib.collections.LineCollection at 0xbf49208>
In the plot above the mean and median are both so close to zero that the red median line lies on top of the thicker black line drawn at the mean.
In skewed distributions, the mean tends to get pulled in the direction of the skew, while the median tends to resist the effects of skew:
In [7]:
skewed_data = pd.DataFrame(np.random.exponential(size=100000))

skewed_data.plot(kind="density",
              figsize=(10,10),
              xlim=(-1,5))


plt.vlines(skewed_data.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.8,
           linewidth=5.0)

plt.vlines(skewed_data.median(),   # Plot red line at median
           ymin=0, 
           ymax=0.8, 
           linewidth=2.0,
           color="red")
Out[7]:
<matplotlib.collections.LineCollection at 0xb33cdd8>
The mean is also influenced heavily by outliers, while the median resists the influence of outliers:
In [8]:
norm_data = np.random.normal(size=50)
outliers = np.random.normal(15, size=3)
combined_data = pd.DataFrame(np.concatenate((norm_data, outliers), axis=0))

combined_data.plot(kind="density",
              figsize=(10,10),
              xlim=(-5,20))


plt.vlines(combined_data.mean(),     # Plot black line at mean
           ymin=0, 
           ymax=0.2,
           linewidth=5.0)

plt.vlines(combined_data.median(),   # Plot red line at median
           ymin=0, 
           ymax=0.2, 
           linewidth=2.0,
           color="red")
Out[8]:
<matplotlib.collections.LineCollection at 0xc4bbc88>
Since the median tends to resist the effects of skewness and outliers, it is known a "robust" statistic. The median generally gives a better sense of the typical value in a distribution with significant skew or outliers.
The mode of a variable is simply the value that appears most frequently. Unlike mean and median, you can take the mode of a categorical variable and it is possible to have multiple modes. Find the mode with df.mode():
In [9]:
mtcars.mode()
Out[9]:
namempgcyldisphpdratwtqsecvsamgearcarb
0NaN10.48275.81103.073.4417.020032
1NaN15.2NaNNaN1753.92NaN18.90NaNNaNNaN4
2NaN19.2NaNNaN180NaNNaNNaNNaNNaNNaNNaN
3NaN21.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4NaN21.4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
5NaN22.8NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
6NaN30.4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
The columns with multiple modes (multiple values with the same count) return multiple values as the mode. Columns with no mode (no value that appears more than once) return NaN.

Measures of Spread

Measures of spread (dispersion) are statistics that describe how data varies. While measures of center give us an idea of the typical value, measures of spread give us a sense of how much the data tends to diverge from the typical value.
One of the simplest measures of spread is the range. Range is the distance between the maximum and minimum observations:
In [10]:
max(mtcars["mpg"]) - min(mtcars["mpg"])
Out[10]:
23.5
As noted earlier, the median represents the 50th percentile of a data set. A summary of several percentiles can be used to describe a variable's spread. We can extract the minimum value (0th percentile), first quartile (25th percentile), median, third quartile(75th percentile) and maximum value (100th percentile) using the quantile() function:
In [11]:
five_num = [mtcars["mpg"].quantile(0),   
            mtcars["mpg"].quantile(0.25),
            mtcars["mpg"].quantile(0.50),
            mtcars["mpg"].quantile(0.75),
            mtcars["mpg"].quantile(1)]

five_num
Out[11]:
[10.4,
 15.425000000000001,
 19.199999999999999,
 22.800000000000001,
 33.899999999999999]
Since these values are so commonly used to describe data, they are known as the "five number summary". They are the same percentile values returned by df.describe():
In [12]:
mtcars["mpg"].describe()
Out[12]:
count    32.000000
mean     20.090625
std       6.026948
min      10.400000
25%      15.425000
50%      19.200000
75%      22.800000
max      33.900000
Name: mpg, dtype: float64
Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile:
In [13]:
mtcars["mpg"].quantile(0.75) - mtcars["mpg"].quantile(0.25)
Out[13]:
7.375
The boxplots we learned to create in the lesson on plotting are just visual representations of the five number summary and IQR:
In [14]:
mtcars.boxplot(column="mpg",
               return_type='axes',
               figsize=(8,8))

plt.text(x=0.74, y=22.25, s="3rd Quartile")
plt.text(x=0.8, y=18.75, s="Median")
plt.text(x=0.75, y=15.5, s="1st Quartile")
plt.text(x=0.9, y=10, s="Min")
plt.text(x=0.9, y=33.5, s="Max")
plt.text(x=0.7, y=19.5, s="IQR", rotation=90, size=25)
Out[14]:
<matplotlib.text.Text at 0xb7c9f98>
Variance and standard deviation are two other common measures of spread. The variance of a distribution is the average of the squared deviations (differences) from the mean. Use df.var() to check variance:
In [15]:
mtcars["mpg"].var()
Out[15]:
36.324102822580642
The standard deviation is the square root of the variance. Standard deviation can be more interpretable than variance, since the standard deviation is expressed in terms of the same units as the variable in question while variance is expressed in terms of units squared. Use df.std() to check the standard deviation:
In [16]:
mtcars["mpg"].std()
Out[16]:
6.0269480520891037
Since variance and standard deviation are both derived from the mean, they are susceptible to the influence of data skew and outliers. Median absolute deviation is an alternative measure of spread based on the median, which inherits the median's robustness against the influence of skew and outliers. It is the median of the absolute value of the deviations from the median:
In [17]:
abs_median_devs = abs(mtcars["mpg"] - mtcars["mpg"].median())

abs_median_devs.median() * 1.4826
Out[17]:
5.411490000000001
*Note: The MAD is often multiplied by a scaling factor of 1.4826.

Skewness and Kurtosis

Beyond measures of center and spread, descriptive statistics include measures that give you a sense of the shape of a distribution. Skewness measures the skew or asymmetry of a distribution while kurtosis measures the "peakedness" of a distribution.  We won't go into the exact calculations behind skewness and kurtosis, but they are essentially just statistics that take the idea of variance a step further: while variance involves squaring deviations from the mean, skewness involves cubing deviations from the mean and kurtosis involves raising deviations from the mean to the 4th power.
Pandas has built in functions for checking skewness and kurtosis, df.skew() and df.kurt() respectively:
In [18]:
mtcars["mpg"].skew()  # Check skewness
Out[18]:
0.6723771376290919
In [19]:
mtcars["mpg"].kurt()  # Check kurtosis
Out[19]:
-0.022006291424083859
To explore these two measures further, let's create some dummy data and inspect it:
In [20]:
norm_data = np.random.normal(size=100000)
skewed_data = np.concatenate((np.random.normal(size=35000)+2, 
                             np.random.exponential(size=65000)), 
                             axis=0)
uniform_data = np.random.uniform(0,2, size=100000)
peaked_data = np.concatenate((np.random.exponential(size=50000),
                             np.random.exponential(size=50000)*(-1)),
                             axis=0)

data_df = pd.DataFrame({"norm":norm_data,
                       "skewed":skewed_data,
                       "uniform":uniform_data,
                       "peaked":peaked_data})
In [21]:
data_df.plot(kind="density",
            figsize=(10,10),
            xlim=(-5,5))
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0xc170be0>
Now let's check the skewness of each of the distributions. Since skewness measures asymmetry, we'd expect to see low skewness for all of the distributions except the skewed one, because all the others are roughly symmetric:
In [22]:
data_df.skew()
Out[22]:
norm       0.005802
peaked    -0.007226
skewed     0.982716
uniform    0.001460
dtype: float64
Now let's check kurtosis. Since kurtosis measures peakedness, we'd expect the flat (uniform) distribution have low kurtosis while the distributions with sharper peaks should have higher kurtosis.
In [23]:
data_df.kurt()
Out[23]:
norm      -0.014785
peaked     2.958413
skewed     1.086500
uniform   -1.196268
dtype: float64
As we can see from the output, the normally distributed data has a kurtosis near zero, the flat distribution has negative kurtosis and the two pointier distributions have positive kurtosis.

Wrap Up

Descriptive statistics help you explore features of your data, like center, spread and shape by summarizing them with numerical measurements. Descriptive statistics help inform the direction of an analysis and let you communicate your insights to others quickly and succinctly. In addition, certain values, like the mean and variance, are used in all sorts of statistical tests and predictive models.
In this lesson, we generated a lot of random data to illustrate concepts, but we haven't actually learned much about the functions we've been using to generate random data. In the next lesson, we'll learn about probability distributions, including how to draw random data from them.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.