Monday, August 31, 2015

Introduction to R Part 26: Analysis of Variance (ANOVA)


In lesson 24 we introduced the t-test for checking whether the means of two groups differ. The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

One-Way ANOVA

The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. It essentially answers the question: do any of the group means differ from one another? We won't get into the details of carrying out an ANOVA by hand as it involves more calculations than the t-test, but the process is similar: you go through several calculations to arrive at a test statistic and then you compare the test statistic to a critical value based on a probability distribution. In the case of the ANOVA, you use the "f-distribution", which you can access with the functions rf(), pf(), qf() and df().
To carry out an ANOVA in R, you can use the aov() function. aov() takes a formula as the first argument of the form: numeric_response_variable ~ categorical_variable. Let's generate some fake voter age and demographic data and use the ANOVA to compare average ages across the groups:
In [1]:
set.seed(12)
voter_race <- sample(c("white", "hispanic",                   # Generate race data
                     "black", "asian", "other"),              
                     prob = c(0.5, 0.25 ,0.15, 0.1, 0.1), 
                     size=1000,
                     replace=TRUE)
 
voter_age <- rnorm(1000,50,20)                        # Generate age data (equal means)
voter_age <- ifelse(voter_age<18,18,voter_age)

av_model <- aov(voter_age ~ voter_race)     # Conduct the ANOVA and store the result
summary(av_model)                           # Check a summary of the test result
Out[1]:
             Df Sum Sq Mean Sq F value Pr(>F)
voter_race    4   1204   300.9   0.815  0.515
Residuals   995 367270   369.1               
In the test output, the test statistic is the F-value of 0.815 and the p-value is 0.515. We could have calculated the p-value using the test statistic, the given degrees of freedom and the f-distribution:
In [2]:
pf(q=0.815,           # f-value
   df1=4,             # Number of groups minus 1
   df2=995,           # Observations minus number of groups
   lower.tail=FALSE)  # Check upper tail only*
Out[2]:
0.515623851249803
*Note: similar to the chi-squared test we are only interested in the upper tail of the distribution.
The test result indicates that there is no evidence that average ages differ based on the race variable, so we'd accept the null hypothesis that none of the groups differ.
Now let's make new age data where the group means do differ and run a second ANOVA:
In [3]:
set.seed(12)
white_dist <- rnorm(1000,55,20)     # Draw ages from a different distribution for white voters
white_dist <- ifelse(white_dist<18,18,white_dist)

new_voter_ages <- ifelse(voter_race == "white", white_dist, voter_age)


av_model <- aov(new_voter_ages ~ voter_race)   # Conduct the ANOVA and store the result
summary(av_model)                              # Check a summary of the test result
Out[3]:
             Df Sum Sq Mean Sq F value Pr(>F)  
voter_race    4   3932   983.0    2.61 0.0342 *
Residuals   995 374665   376.5                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the code above, we changed the average age for white voters to 55 while keeping the other groups unchanged with an average age of 50. The resulting p-value 0.034 means our test is now significant at the 95% level. Notice that the test output does not indicate which group mean(s) differ from the others. We know that it is the white voters who differ because we set it up that way, but when testing real data, you may not know which group(s) caused the the test to throw a positive result. To check which groups differ after getting a positive ANOVA result, you can perform a follow up test or "post-hoc test".
One possible post-hoc test is to perform a separate t-test for each pair of groups. You can perform a t-test between all pairs using the pairwise.t.test() function:
In [4]:
pairwise.t.test(new_voter_ages,      # Conduct pairwise t-tests bewteen all groups
                voter_race, 
                p.adj = "none")      # Do not adjust resulting p-value
Out[4]:
 Pairwise comparisons using t tests with pooled SD 

data:  new_voter_ages and voter_race 

         asian black hispanic other
black    0.255 -     -        -    
hispanic 0.402 0.648 -        -    
other    0.888 0.202 0.323    -    
white    0.408 0.008 0.013    0.532

P value adjustment method: none 
The resulting table shows the p-values for each pairwise t-test. Using unadjusted pairwise t-tests can overestimate significance because the more comparisons you make, the more likely you are to come across an unlikely result due to chance. We can account for this multiple comparison problem by specifying a p adjustment argument:
In [5]:
pairwise.t.test(new_voter_ages,        # Conduct pairwise t-tests between all groups
                voter_race, 
                p.adj = "bonferroni")  # Use Bonferroni correction*
Out[5]:
 Pairwise comparisons using t tests with pooled SD 

data:  new_voter_ages and voter_race 

         asian black hispanic other
black    1.00  -     -        -    
hispanic 1.00  1.00  -        -    
other    1.00  1.00  1.00     -    
white    1.00  0.08  0.13     1.00 

P value adjustment method: bonferroni 
*Note: Bonferroni correction adjusts the significance level α by dividing it by the number of comparisons made.
Note that after adjusting for multiple comparisons, the p-values are no longer significant at the 95% level. The Bonferroni correction is somewhat conservative in its p-value estimates.
Another common post hoc-test is Tukey's test. You can carry out Tukey's test using the built-in R function TukeyHSD():
In [6]:
TukeyHSD(av_model)      # Pass fitted ANOVA model
Out[6]:
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = new_voter_ages ~ voter_race)

$voter_race
                     diff         lwr       upr     p adj
black-asian    -3.0009576 -10.2001836  4.198268 0.7856694
hispanic-asian -2.0604424  -8.7833237  4.662439 0.9188866
other-asian     0.4235040  -7.7872349  8.634243 0.9999112
white-asian     1.8851439  -4.3437325  8.114020 0.9222716
hispanic-black  0.9405152  -4.6833808  6.564411 0.9910128
other-black     3.4244616  -3.9136107 10.762534 0.7065118
white-black     4.8861015  -0.1368432  9.909046 0.0610682
other-hispanic  2.4839463  -4.3874134  9.355306 0.8608383
white-hispanic  3.9455862  -0.3669829  8.258155 0.0913566
white-other     1.4616399  -4.9272059  7.850486 0.9710417
The output of the Tukey test shows the average difference, a confidence interval, as well as a p-value for each pair of groups. Again, we see low p-value for the white-black and white-Hispanic comparisons, suggesting that the white group is the one that led to the positive ANOVA result.

Two-Way ANOVA

The two-way ANOVA extends the analysis of variance to cases where you have two categorical variables of interest. For example, a two-way ANOVA would let us check whether voter age varies across two demographic variables like race and gender at the same time. You can conduct a two way ANOVA by passing an extra categorical variable into the formula supplied to the aov() function. Let's make a new variable for voter gender, alter voter ages based on that variable and then do a two-way ANOVA test investigating the effects of voter gender and race on age:
In [7]:
set.seed(10)
voter_gender <- sample(c("male","female"),     # Generate genders
                       size=1000, 
                       prob=c(0.5,0.5),
                       replace = TRUE)
# Alter age based on gender
voter_age2 <- ifelse(voter_gender=="male", voter_age-1.5, voter_age+1.5) voter_age2 <- ifelse(voter_age2<18,18,voter_age2) av_model <- aov(voter_age2 ~ voter_race + voter_gender) # Perform ANOVA summary(av_model) # Show the result
Out[7]:
              Df Sum Sq Mean Sq F value Pr(>F)  
voter_race     4   1203   300.8   0.821 0.5117  
voter_gender   1   1743  1743.4   4.760 0.0294 *
Residuals    994 364093   366.3                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the code above we added 1.5 years to the age of each female voter and subtracted 1.5 years from the age of each male voter. The test result detects this difference in age based on gender with a p-value of 0.029 for the voter_gender variable. On the other hand, the voter_race variable appears to have no significant effect on age.
The two-way ANOVA can also test for interactions between the categorical variables. To check for interaction, add a third term to the formula you supply to aov() equal to the product of the two categorical variables:
In [8]:
av_model <- aov(voter_age2 ~ voter_race + voter_gender +    # Repeat the test
               (voter_race * voter_gender))                 # Add interaction term

summary(av_model)                                           # Check result
Out[8]:
                         Df Sum Sq Mean Sq F value Pr(>F)  
voter_race                4   1203   300.8   0.820 0.5122  
voter_gender              1   1743  1743.4   4.755 0.0294 *
voter_race:voter_gender   4   1132   282.9   0.772 0.5438  
Residuals               990 362961   366.6                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test result shows no significant interaction between gender and race, which is expected given that we created both independently. Let's create a new age variable with an interaction between gender and race and then run the test again:
In [9]:
# Increase the age of asian female voters by 10
interaction_age <- ifelse((voter_gender=="female")&(voter_race=="asian"), 
                    (voter_age + 10), voter_age)    # Alter age based on gender and race

av_model <- aov(interaction_age ~ voter_race + voter_gender +     # Repeat the test
               (voter_race * voter_gender))                       

summary(av_model)
Out[9]:
                         Df Sum Sq Mean Sq F value Pr(>F)  
voter_race                4   4612  1153.0   3.118 0.0146 *
voter_gender              1     89    88.7   0.240 0.6244  
voter_race:voter_gender   4   4229  1057.2   2.859 0.0226 *
Residuals               990 366107   369.8                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, we see a low p-value for the interaction between race and gender. A low p-value for the interaction term suggests that some group defined by a combination of the two categorical variables may be having a large influence on the test results. In this case, we added 10 to the ages of all Asian women voters, while all other gender/race combinations are drawn from the same distribution. To identify the specific variable combination affecting our results, we can run Tukey's test and inspect the interactions:
In [10]:
TukeyHSD(av_model)      # Pass fitted ANOVA model
Out[10]:
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = interaction_age ~ voter_race + voter_gender + (voter_race * voter_gender))

$voter_race
                     diff        lwr         upr     p adj
black-asian    -7.6521204 -14.786664 -0.51757695 0.0284489
hispanic-asian -6.7116051 -13.374084 -0.04912651 0.0473129
other-asian    -4.2276588 -12.364627  3.90930939 0.6149472
white-asian    -7.3934760 -13.566388 -1.22056391 0.0096866
hispanic-black  0.9405152  -4.632852  6.51388251 0.9906984
other-black     3.4244616  -3.847681 10.69660382 0.6994030
white-black     0.2586444  -4.719171  5.23645967 0.9999086
other-hispanic  2.4839463  -4.325677  9.29356924 0.8568273
white-hispanic -0.6818709  -4.955693  3.59195136 0.9924901
white-other    -3.1658172  -9.497261  3.16562710 0.6493747

$voter_gender
                  diff       lwr      upr     p adj
male-female -0.5955427 -2.982237 1.791151 0.6244823

$`voter_race:voter_gender`
                                      diff        lwr        upr     p adj
black:female-asian:female     -15.87770243 -27.933022 -3.8223831 0.0013260
hispanic:female-asian:female  -14.74735615 -25.966324 -3.5283884 0.0013720
other:female-asian:female     -10.94914470 -24.421040  2.5227502 0.2298501
white:female-asian:female     -13.31288410 -23.749261 -2.8765069 0.0022776
asian:male-asian:female       -12.37309637 -25.556171  0.8099784 0.0872193
black:male-asian:female       -12.76862494 -24.680184 -0.8570660 0.0244322
hispanic:male-asian:female    -11.89953279 -23.131587 -0.6674787 0.0278104
other:male-asian:female       -10.73456106 -24.456884  2.9877622 0.2802153
white:male-asian:female       -14.72572729 -25.178858 -4.2725967 0.0003755
hispanic:female-black:female    1.13034628  -8.104229 10.3649213 0.9999968
other:female-black:female       4.92855772  -6.941745 16.7988607 0.9498841
white:female-black:female       2.56481833  -5.701384 10.8310211 0.9931353
asian:male-black:female         3.50460606  -8.036867 15.0460791 0.9941469
black:male-black:female         3.10907749  -6.955582 13.1737372 0.9933549
hispanic:male-black:female      3.97816963  -5.272300 13.2286388 0.9378622
other:male-black:female         5.14314136  -7.010636 17.2969184 0.9436563
white:male-black:female         1.15197514  -7.135369  9.4393197 0.9999902
other:female-hispanic:female    3.79821145  -7.221707 14.8181299 0.9853952
white:female-hispanic:female    1.43447205  -5.555851  8.4247954 0.9997336
asian:male-hispanic:female      2.37425978  -8.290641 13.0391607 0.9994809
black:male-hispanic:female      1.97873121  -7.067367 11.0248291 0.9995502
hispanic:male-hispanic:female   2.84782335  -5.282717 10.9783640 0.9836604
other:male-hispanic:female      4.01279509  -7.311904 15.3374941 0.9822981
white:male-hispanic:female      0.02162886  -6.993682  7.0369401 1.0000000
white:female-other:female      -2.36373940 -12.585840  7.8583610 0.9992942
asian:male-other:female        -1.42395167 -14.438053 11.5901494 0.9999988
black:male-other:female        -1.81948024 -13.543754  9.9047938 0.9999747
hispanic:male-other:female     -0.95038809 -11.983629 10.0828529 0.9999999
other:male-other:female         0.21458364 -13.345487 13.7746546 1.0000000
white:male-other:female        -3.77658259 -14.015787  6.4626218 0.9767331
asian:male-white:female         0.93978773  -8.898548 10.7781232 0.9999996
black:male-white:female         0.54425916  -7.510840  8.5993582 1.0000000
hispanic:male-white:female      1.41335130  -5.597956  8.4246582 0.9997701
other:male-white:female         2.57832304  -7.971631 13.1282776 0.9989008
white:male-white:female        -1.41284319  -7.093277  4.2675903 0.9987365
black:male-asian:male          -0.39552857 -11.786758 10.9957010 1.0000000
hispanic:male-asian:male        0.47356357 -10.205103 11.1522299 1.0000000
other:male-asian:male           1.63853531 -11.634634 14.9117049 0.9999965
white:male-asian:male          -2.35263092 -12.208736  7.5034745 0.9990892
hispanic:male-black:male        0.86909214  -8.193230  9.9314147 0.9999996
other:male-black:male           2.03406388  -9.977131 14.0452590 0.9999467
white:male-black:male          -1.95710235 -10.033896  6.1196910 0.9989731
other:male-hispanic:male        1.16497173 -10.172692 12.5026351 0.9999993
white:male-hispanic:male       -2.82619449  -9.862415  4.2100257 0.9593104
white:male-other:male          -3.99116623 -14.557694  6.5753617 0.9727189
The output shows low p-values for several comparisons between Asian females and other groups. If this were a real study, this result might lead us toward investigating Asian females as a subgroup in more detail.

Wrap Up

The one-way and two-way ANOVA tests let us check whether a numeric response variable varies according to the levels of one or two categorical variables. R makes it easy to perform ANOVA tests without diving too deep into the details of the procedure.
Next time, we'll move on from statistical inference to the final topic of this intro: predictive modeling.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.