* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:
Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis. We've already seen several examples of deceptive statistics in earlier lessons, such as means and medians. In this lesson, we'll review some of these functions and explore several new ones.
Measures of Center
Measures of center are statistics that give us a sense of the "middle" of a numeric variable. In other words, centrality measures give you a sense of a typical value you'd expect to see. Common measures of center include the mean, median and mode.
The mean is simply an average: the sum of the values divided by the total number of records. As we've seen in previous lessons we can use df.mean() to get the mean of each column in a DataFrame:
In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ggplot import mtcars
In [3]:
mtcars.index = mtcars["name"]
mtcars.mean() # Get the mean of each column
Out[3]:
We can also get the means of each row by supplying an axis argument:
In [4]:
mtcars.mean(axis=1) # Get the mean of each row
Out[4]:
The median of a distribution is the value where 50% of the data lies below it and 50% lies above it. In essence, the median splits the data in half. The median is also known as the 50% percentile since 50% of the observations are found below it. As we've seen previously, you can get the median using the df.median() function:
In [5]:
mtcars.median() # Get the median of each column
Out[5]:
Again, we could get the row medians across each row by supplying the argument axis=1.
Although the mean and median both give us some sense of the center of a distribution, they aren't always the same. The median always gives us a value that splits the data into two halves while the mean is a numeric average so extreme values can have a significant impact on the mean. In a symmetric distribution, the mean and median will be the same. Let's investigate with a density plot:
In [6]:
norm_data = pd.DataFrame(np.random.normal(size=100000))
norm_data.plot(kind="density",
figsize=(10,10))
plt.vlines(norm_data.mean(), # Plot black line at mean
ymin=0,
ymax=0.4,
linewidth=5.0)
plt.vlines(norm_data.median(), # Plot red line at median
ymin=0,
ymax=0.4,
linewidth=2.0,
color="red")
Out[6]:
In the plot above the mean and median are both so close to zero that the red median line lies on top of the thicker black line drawn at the mean.
In skewed distributions, the mean tends to get pulled in the direction of the skew, while the median tends to resist the effects of skew:
In [7]:
skewed_data = pd.DataFrame(np.random.exponential(size=100000))
skewed_data.plot(kind="density",
figsize=(10,10),
xlim=(-1,5))
plt.vlines(skewed_data.mean(), # Plot black line at mean
ymin=0,
ymax=0.8,
linewidth=5.0)
plt.vlines(skewed_data.median(), # Plot red line at median
ymin=0,
ymax=0.8,
linewidth=2.0,
color="red")
Out[7]:
The mean is also influenced heavily by outliers, while the median resists the influence of outliers:
In [8]:
norm_data = np.random.normal(size=50)
outliers = np.random.normal(15, size=3)
combined_data = pd.DataFrame(np.concatenate((norm_data, outliers), axis=0))
combined_data.plot(kind="density",
figsize=(10,10),
xlim=(-5,20))
plt.vlines(combined_data.mean(), # Plot black line at mean
ymin=0,
ymax=0.2,
linewidth=5.0)
plt.vlines(combined_data.median(), # Plot red line at median
ymin=0,
ymax=0.2,
linewidth=2.0,
color="red")
Out[8]:
Since the median tends to resist the effects of skewness and outliers, it is known a "robust" statistic. The median generally gives a better sense of the typical value in a distribution with significant skew or outliers.
The mode of a variable is simply the value that appears most frequently. Unlike mean and median, you can take the mode of a categorical variable and it is possible to have multiple modes. Find the mode with df.mode():
In [9]:
mtcars.mode()
Out[9]:
The columns with multiple modes (multiple values with the same count) return multiple values as the mode. Columns with no mode (no value that appears more than once) return NaN.
Measures of Spread
Measures of spread (dispersion) are statistics that describe how data varies. While measures of center give us an idea of the typical value, measures of spread give us a sense of how much the data tends to diverge from the typical value.
One of the simplest measures of spread is the range. Range is the distance between the maximum and minimum observations:
In [10]:
max(mtcars["mpg"]) - min(mtcars["mpg"])
Out[10]:
As noted earlier, the median represents the 50th percentile of a data set. A summary of several percentiles can be used to describe a variable's spread. We can extract the minimum value (0th percentile), first quartile (25th percentile), median, third quartile(75th percentile) and maximum value (100th percentile) using the quantile() function:
In [11]:
five_num = [mtcars["mpg"].quantile(0),
mtcars["mpg"].quantile(0.25),
mtcars["mpg"].quantile(0.50),
mtcars["mpg"].quantile(0.75),
mtcars["mpg"].quantile(1)]
five_num
Out[11]:
Since these values are so commonly used to describe data, they are known as the "five number summary". They are the same percentile values returned by df.describe():
In [12]:
mtcars["mpg"].describe()
Out[12]:
Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile:
In [13]:
mtcars["mpg"].quantile(0.75) - mtcars["mpg"].quantile(0.25)
Out[13]:
The boxplots we learned to create in the lesson on plotting are just visual representations of the five number summary and IQR:
In [14]:
mtcars.boxplot(column="mpg",
return_type='axes',
figsize=(8,8))
plt.text(x=0.74, y=22.25, s="3rd Quartile")
plt.text(x=0.8, y=18.75, s="Median")
plt.text(x=0.75, y=15.5, s="1st Quartile")
plt.text(x=0.9, y=10, s="Min")
plt.text(x=0.9, y=33.5, s="Max")
plt.text(x=0.7, y=19.5, s="IQR", rotation=90, size=25)
Out[14]:
Variance and standard deviation are two other common measures of spread. The variance of a distribution is the average of the squared deviations (differences) from the mean. Use df.var() to check variance:
In [15]:
mtcars["mpg"].var()
Out[15]:
The standard deviation is the square root of the variance. Standard deviation can be more interpretable than variance, since the standard deviation is expressed in terms of the same units as the variable in question while variance is expressed in terms of units squared. Use df.std() to check the standard deviation:
In [16]:
mtcars["mpg"].std()
Out[16]:
Since variance and standard deviation are both derived from the mean, they are susceptible to the influence of data skew and outliers. Median absolute deviation is an alternative measure of spread based on the median, which inherits the median's robustness against the influence of skew and outliers. It is the median of the absolute value of the deviations from the median:
In [17]:
abs_median_devs = abs(mtcars["mpg"] - mtcars["mpg"].median())
abs_median_devs.median() * 1.4826
Out[17]:
*Note: The MAD is often multiplied by a scaling factor of 1.4826.
Skewness and Kurtosis
Beyond measures of center and spread, descriptive statistics include measures that give you a sense of the shape of a distribution. Skewness measures the skew or asymmetry of a distribution while kurtosis measures the "peakedness" of a distribution. We won't go into the exact calculations behind skewness and kurtosis, but they are essentially just statistics that take the idea of variance a step further: while variance involves squaring deviations from the mean, skewness involves cubing deviations from the mean and kurtosis involves raising deviations from the mean to the 4th power.
Pandas has built in functions for checking skewness and kurtosis, df.skew() and df.kurt() respectively:
In [18]:
mtcars["mpg"].skew() # Check skewness
Out[18]:
In [19]:
mtcars["mpg"].kurt() # Check kurtosis
Out[19]:
To explore these two measures further, let's create some dummy data and inspect it:
In [20]:
norm_data = np.random.normal(size=100000)
skewed_data = np.concatenate((np.random.normal(size=35000)+2,
np.random.exponential(size=65000)),
axis=0)
uniform_data = np.random.uniform(0,2, size=100000)
peaked_data = np.concatenate((np.random.exponential(size=50000),
np.random.exponential(size=50000)*(-1)),
axis=0)
data_df = pd.DataFrame({"norm":norm_data,
"skewed":skewed_data,
"uniform":uniform_data,
"peaked":peaked_data})
In [21]:
data_df.plot(kind="density",
figsize=(10,10),
xlim=(-5,5))
Out[21]:
Now let's check the skewness of each of the distributions. Since skewness measures asymmetry, we'd expect to see low skewness for all of the distributions except the skewed one, because all the others are roughly symmetric:
In [22]:
data_df.skew()
Out[22]:
Now let's check kurtosis. Since kurtosis measures peakedness, we'd expect the flat (uniform) distribution have low kurtosis while the distributions with sharper peaks should have higher kurtosis.
In [23]:
data_df.kurt()
Out[23]:
As we can see from the output, the normally distributed data has a kurtosis near zero, the flat distribution has negative kurtosis and the two pointier distributions have positive kurtosis.
Wrap Up
Descriptive statistics help you explore features of your data, like center, spread and shape by summarizing them with numerical measurements. Descriptive statistics help inform the direction of an analysis and let you communicate your insights to others quickly and succinctly. In addition, certain values, like the mean and variance, are used in all sorts of statistical tests and predictive models.
In this lesson, we generated a lot of random data to illustrate concepts, but we haven't actually learned much about the functions we've been using to generate random data. In the next lesson, we'll learn about probability distributions, including how to draw random data from them.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.