* Edit Jan 2021: I recently completed a YouTube video covering topics in this post
In lesson 24 we introduced the t-test for checking whether the means of two groups differ. The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.
One-Way ANOVA
The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. It essentially answers the question: do any of the group means differ from one another? We won't get into the details of carrying out an ANOVA by hand as it involves more calculations than the t-test, but the process is similar: you go through several calculations to arrive at a test statistic and then you compare the test statistic to a critical value based on a probability distribution. In the case of the ANOVA, you use the "f-distribution".
The scipy library has a function for carrying out one-way ANOVA tests called scipy.stats.f_oneway(). Let's generate some fake voter age and demographic data and use the ANOVA to compare average ages across the groups:
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
In [3]:
np.random.seed(12)
races = ["asian","black","hispanic","other","white"]
# Generate random data
voter_race = np.random.choice(a= races,
p = [0.05, 0.15 ,0.25, 0.05, 0.5],
size=1000)
voter_age = stats.poisson.rvs(loc=18,
mu=30,
size=1000)
# Group age data by race
voter_frame = pd.DataFrame({"race":voter_race,"age":voter_age})
groups = voter_frame.groupby("race").groups
# Etract individual groups
asian = voter_age[groups["asian"]]
black = voter_age[groups["black"]]
hispanic = voter_age[groups["hispanic"]]
other = voter_age[groups["other"]]
white = voter_age[groups["white"]]
# Perform the ANOVA
stats.f_oneway(asian, black, hispanic, other, white)
Out[3]:
The test output yields an F-statistic of 1.774 and a p-value of 0.1317, indicating that there is no significant difference between the means of each group.
Now let's make new age data where the group means do differ and run a second ANOVA:
In [4]:
np.random.seed(12)
# Generate random data
voter_race = np.random.choice(a= races,
p = [0.05, 0.15 ,0.25, 0.05, 0.5],
size=1000)
# Use a different distribution for white ages
white_ages = stats.poisson.rvs(loc=18,
mu=32,
size=1000)
voter_age = stats.poisson.rvs(loc=18,
mu=30,
size=1000)
voter_age = np.where(voter_race=="white", white_ages, voter_age)
# Group age data by race
voter_frame = pd.DataFrame({"race":voter_race,"age":voter_age})
groups = voter_frame.groupby("race").groups
# Extract individual groups
asian = voter_age[groups["asian"]]
black = voter_age[groups["black"]]
hispanic = voter_age[groups["hispanic"]]
other = voter_age[groups["other"]]
white = voter_age[groups["white"]]
# Perform the ANOVA
stats.f_oneway(asian, black, hispanic, other, white)
Out[4]:
The test result suggests the groups don't have the same sample means in this case, since the p-value is significant at a 99% confidence level. We know that it is the white voters who differ because we set it up that way in the code, but when testing real data, you may not know which group(s) caused the test to throw a positive result. To check which groups differ after getting a positive ANOVA result, you can perform a follow up test or "post-hoc test".
One post-hoc test is to perform a separate t-test for each pair of groups. You can perform a t-test between all pairs using by running each pair through the stats.ttest_ind() we covered in the lesson on t-tests:
In [5]:
# Get all race pairs
race_pairs = []
for race1 in range(4):
for race2 in range(race1+1,5):
race_pairs.append((races[race1], races[race2]))
# Conduct t-test on each pair
for race1, race2 in race_pairs:
print(race1, race2)
print(stats.ttest_ind(voter_age[groups[race1]],
voter_age[groups[race2]]))
The p-values for each pairwise t-test suggest mean of white voters is likely different from the other groups, since the p-values for each t-test involving the white group is below 0.05. Using unadjusted pairwise t-tests can overestimate significance, however, because the more comparisons you make, the more likely you are to come across an unlikely result due to chance. We can adjust for this multiple comparison problem by dividing the statistical significance level by the number of comparisons made. In this case, if we were looking for a significance level of 5%, we'd be looking for p-values of 0.05/10 = 0.005 or less. This simple adjustment for multiple comparisons is known as the Bonferroni correction.
The Bonferroni correction is a conservative approach to account for the multiple comparisons problem that may end up rejecting results that are actually significant. Another common post hoc-test is Tukey's test. You can carry out Tukey's test using the pairwise_tukeyhsd() function in the statsmodels.stats.multicomp library:
In [6]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(endog=voter_age, # Data
groups=voter_race, # Groups
alpha=0.05) # Significance level
tukey.plot_simultaneous() # Plot group confidence intervals
plt.vlines(x=49.57,ymin=-0.5,ymax=4.5, color="red")
tukey.summary() # See test summary
Out[6]:
The output of the Tukey test shows the average difference, a confidence interval as well as whether you should reject the null hypothesis for each pair of groups at the given significance level. In this case, the test suggests we reject the null hypothesis for 3 pairs, with each pair including the "white" category. This suggests the white group is likely different from the others. The 95% confidence interval plot reinforces the results visually: only 1 other group's confidence interval overlaps the white group's confidence interval.
Wrap Up
The ANOVA test lets us check whether a numeric response variable varies according to the levels of a categorical variable. Python's scipy library makes it easy to perform an ANOVA without diving too deep into the details of the procedure.
Next time, we'll move on from statistical inference to the final topic of this guide: predictive modeling.
Thank you very much!
ReplyDeleteThe guide derives everything deeply for solid understanding. I have learnt Statistics theory oriented, however, it is more helpful to learn by coding .
Hi, I amtrying to run your example but I get the following error
ReplyDelete>>> voter_age = stats.poisson.rvs(loc=18,mu=30,size=1000)
Traceback (most recent call last):
File "", line 1, in
File "anaconda/lib/python2.7/site-packages/scipy/stats/distributions.py", line 5923, in rvs
return super(rv_discrete, self).rvs(*args, **kwargs)
File "anaconda/lib/python2.7/site-packages/scipy/stats/distributions.py", line 623, in rvs
vals = self._rvs(*args)
TypeError: _rvs() takes exactly 2 arguments (1 given)
It could be an incompatibly issue with Python 2.7. This guide was written using Python 3.4
Delete