In lesson 24 we introduced the t-test for checking whether the means of two groups differ. The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.
One-Way ANOVA
The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. It essentially answers the question: do any of the group means differ from one another? We won't get into the details of carrying out an ANOVA by hand as it involves more calculations than the t-test, but the process is similar: you go through several calculations to arrive at a test statistic and then you compare the test statistic to a critical value based on a probability distribution. In the case of the ANOVA, you use the "f-distribution", which you can access with the functions rf(), pf(), qf() and df().
To carry out an ANOVA in R, you can use the aov() function. aov() takes a formula as the first argument of the form: numeric_response_variable ~ categorical_variable. Let's generate some fake voter age and demographic data and use the ANOVA to compare average ages across the groups:
In [1]:
set.seed(12)
voter_race <- sample(c("white", "hispanic", # Generate race data
"black", "asian", "other"),
prob = c(0.5, 0.25 ,0.15, 0.1, 0.1),
size=1000,
replace=TRUE)
voter_age <- rnorm(1000,50,20) # Generate age data (equal means)
voter_age <- ifelse(voter_age<18,18,voter_age)
av_model <- aov(voter_age ~ voter_race) # Conduct the ANOVA and store the result
summary(av_model) # Check a summary of the test result
Out[1]:
In the test output, the test statistic is the F-value of 0.815 and the p-value is 0.515. We could have calculated the p-value using the test statistic, the given degrees of freedom and the f-distribution:
In [2]:
pf(q=0.815, # f-value
df1=4, # Number of groups minus 1
df2=995, # Observations minus number of groups
lower.tail=FALSE) # Check upper tail only*
Out[2]:
*Note: similar to the chi-squared test we are only interested in the upper tail of the distribution.
The test result indicates that there is no evidence that average ages differ based on the race variable, so we'd accept the null hypothesis that none of the groups differ.
Now let's make new age data where the group means do differ and run a second ANOVA:
In [3]:
set.seed(12)
white_dist <- rnorm(1000,55,20) # Draw ages from a different distribution for white voters
white_dist <- ifelse(white_dist<18,18,white_dist)
new_voter_ages <- ifelse(voter_race == "white", white_dist, voter_age)
av_model <- aov(new_voter_ages ~ voter_race) # Conduct the ANOVA and store the result
summary(av_model) # Check a summary of the test result
Out[3]:
In the code above, we changed the average age for white voters to 55 while keeping the other groups unchanged with an average age of 50. The resulting p-value 0.034 means our test is now significant at the 95% level. Notice that the test output does not indicate which group mean(s) differ from the others. We know that it is the white voters who differ because we set it up that way, but when testing real data, you may not know which group(s) caused the the test to throw a positive result. To check which groups differ after getting a positive ANOVA result, you can perform a follow up test or "post-hoc test".
One possible post-hoc test is to perform a separate t-test for each pair of groups. You can perform a t-test between all pairs using the pairwise.t.test() function:
In [4]:
pairwise.t.test(new_voter_ages, # Conduct pairwise t-tests bewteen all groups
voter_race,
p.adj = "none") # Do not adjust resulting p-value
Out[4]:
The resulting table shows the p-values for each pairwise t-test. Using unadjusted pairwise t-tests can overestimate significance because the more comparisons you make, the more likely you are to come across an unlikely result due to chance. We can account for this multiple comparison problem by specifying a p adjustment argument:
In [5]:
pairwise.t.test(new_voter_ages, # Conduct pairwise t-tests between all groups
voter_race,
p.adj = "bonferroni") # Use Bonferroni correction*
Out[5]:
*Note: Bonferroni correction adjusts the significance level α by dividing it by the number of comparisons made.
Note that after adjusting for multiple comparisons, the p-values are no longer significant at the 95% level. The Bonferroni correction is somewhat conservative in its p-value estimates.
Another common post hoc-test is Tukey's test. You can carry out Tukey's test using the built-in R function TukeyHSD():
In [6]:
TukeyHSD(av_model) # Pass fitted ANOVA model
Out[6]:
The output of the Tukey test shows the average difference, a confidence interval, as well as a p-value for each pair of groups. Again, we see low p-value for the white-black and white-Hispanic comparisons, suggesting that the white group is the one that led to the positive ANOVA result.
Two-Way ANOVA
The two-way ANOVA extends the analysis of variance to cases where you have two categorical variables of interest. For example, a two-way ANOVA would let us check whether voter age varies across two demographic variables like race and gender at the same time. You can conduct a two way ANOVA by passing an extra categorical variable into the formula supplied to the aov() function. Let's make a new variable for voter gender, alter voter ages based on that variable and then do a two-way ANOVA test investigating the effects of voter gender and race on age:
In [7]:
set.seed(10)
voter_gender <- sample(c("male","female"), # Generate genders
size=1000,
prob=c(0.5,0.5),
replace = TRUE)
# Alter age based on gender
voter_age2 <- ifelse(voter_gender=="male", voter_age-1.5, voter_age+1.5)
voter_age2 <- ifelse(voter_age2<18,18,voter_age2)
av_model <- aov(voter_age2 ~ voter_race + voter_gender) # Perform ANOVA
summary(av_model) # Show the result
Out[7]:
In the code above we added 1.5 years to the age of each female voter and subtracted 1.5 years from the age of each male voter. The test result detects this difference in age based on gender with a p-value of 0.029 for the voter_gender variable. On the other hand, the voter_race variable appears to have no significant effect on age.
The two-way ANOVA can also test for interactions between the categorical variables. To check for interaction, add a third term to the formula you supply to aov() equal to the product of the two categorical variables:
In [8]:
av_model <- aov(voter_age2 ~ voter_race + voter_gender + # Repeat the test
(voter_race * voter_gender)) # Add interaction term
summary(av_model) # Check result
Out[8]:
The test result shows no significant interaction between gender and race, which is expected given that we created both independently. Let's create a new age variable with an interaction between gender and race and then run the test again:
In [9]:
# Increase the age of asian female voters by 10
interaction_age <- ifelse((voter_gender=="female")&(voter_race=="asian"),
(voter_age + 10), voter_age) # Alter age based on gender and race
av_model <- aov(interaction_age ~ voter_race + voter_gender + # Repeat the test
(voter_race * voter_gender))
summary(av_model)
Out[9]:
In this case, we see a low p-value for the interaction between race and gender. A low p-value for the interaction term suggests that some group defined by a combination of the two categorical variables may be having a large influence on the test results. In this case, we added 10 to the ages of all Asian women voters, while all other gender/race combinations are drawn from the same distribution. To identify the specific variable combination affecting our results, we can run Tukey's test and inspect the interactions:
In [10]:
TukeyHSD(av_model) # Pass fitted ANOVA model
Out[10]:
The output shows low p-values for several comparisons between Asian females and other groups. If this were a real study, this result might lead us toward investigating Asian females as a subgroup in more detail.
Wrap Up
The one-way and two-way ANOVA tests let us check whether a numeric response variable varies according to the levels of one or two categorical variables. R makes it easy to perform ANOVA tests without diving too deep into the details of the procedure.
Next time, we'll move on from statistical inference to the final topic of this intro: predictive modeling.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.