Thursday, August 27, 2015

Introduction to R Part 25: Chi-Squared Tests


Last lesson we introduced the framework of statistical hypothesis testing and the t-test for investigating differences in numeric variables. In this lesson we turn our attention to a common statistical test for categorical variables: the chi-squared test.

Chi-Squared Goodness-Of-Fit Test

In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.
When working with categorical data the values the observations themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves.
Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:
In [1]:
national_demographics <- c(rep("white",100000),      # Fake demographic data
                           rep("hispanic",60000),
                           rep("black",50000),
                           rep("asian",15000),
                           rep("other",35000))

minnesota_demographics <- c(rep("white", 600),      # Fake sample data
                           rep("hispanic", 300),
                           rep("black", 250),
                           rep("asian", 75),
                           rep("other", 150))

table(national_demographics)            # Check counts

table(minnesota_demographics)
Out[1]:
national_demographics
   asian    black hispanic    other    white 
   15000    50000    60000    35000   100000 
Out[1]:
minnesota_demographics
   asian    black hispanic    other    white 
      75      250      300      150      600 
Chi-squared tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the following formula:

sum((observedexpected)2expected)
In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. Let's calculate the chi-squared statistic for our data to illustrate:
In [2]:
observed <- table(minnesota_demographics)

national_ratios <- prop.table(table(national_demographics))     # Get population ratios

expected <- national_ratios * length(minnesota_demographics)    # Get expected counts

expected                # Check expected counts

chi_squared_statistic <- sum(((observed-expected)^2)/expected) # Calculate the statistic

chi_squared_statistic
Out[2]:
national_demographics
    asian     black  hispanic     other     white 
 79.32692 264.42308 317.30769 185.09615 528.84615 
Out[2]:
18.1948051948052
*Note: The chi-squared test assumes none of the expected counts are less than 5.
Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. In R, the nickname for the chi-square distribution is "chisq", so we can use the functions rchisq(), pchisq(), qchisq() and dchisq() to work with it like any other probability distribution. Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result:
In [3]:
qchisq(p=0.95,       # Find the critical value for 95% significance*
       df=4)         # Degrees of freedom = number of variable categories - 1

1-pchisq(q=18.1948,  # Find the p-value for the chi-square statistic
         df=4)
Out[3]:
9.48772903678115
Out[3]:
0.00113046973828934
*Note: we are only interested in the right tail of the chi-square distribution. Read more on this here.
Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.
You can carry out a chi-squared goodness-of-fit test automatically using the built-in R function chisq.test():
In [4]:
chisq.test(x= observed,          # Table of observed counts
           p= national_ratios)   # Expected proportions
Out[4]:
 Chi-squared test for given probabilities

data:  observed
X-squared = 18.1948, df = 4, p-value = 0.00113
The test results agree with the values we calculated earlier.

Chi-Squared Test of Independence

Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another. For instance, the month you were born probably doesn't tell you anything which web browser you use, so we'd expect birth month and browser preference to be independent. On the other hand, your month of birth might be related to whether you excelled at sports in school, so month of birth and sports performance might not be independent.
The chi-squared test of independence tests whether two categorical variables are independent. The test of independence is commonly used to determine whether variables like education, political views and other preferences vary based on demographic factors like gender, race and religion. Let's generate some fake voter polling data and perform a test of independence:
In [5]:
set.seed(12)
voter_race <- sample(c("white", "hispanic", 
                     "black", "asian", "other"),                # Generate race data
                     prob = c(0.5, 0.25 ,0.15, 0.05, 0.15), 
                     size=1000,
                     replace=TRUE)

table(voter_race)         # Check counts

voter_party <- sample(c("democrat","republican","independent"),  # Generate party data
                     prob = c(0.4, 0.4, 0.2), 
                     size=1000,
                     replace=TRUE)

voter_table <- table(voter_race, voter_party) 
voter_table
Out[5]:
voter_race
   asian    black hispanic    other    white 
      38      147      225      129      461 
Out[5]:
          voter_party
voter_race democrat independent republican
  asian          13          11         14
  black          62          31         54
  hispanic      104          34         87
  other          47          25         57
  white         194          75        192
Note that we did not use the race data to inform our generation of the party data so the variables are independent.
For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells in the table by taking the rowSums() and colSums() of the table, performing an outer product on them with the outer() function and dividing by the number of observations:
In [6]:
expected <- outer(rowSums(voter_table),   # Take the outer product of row and col totals
                  colSums(voter_table))/sum(voter_table)     # Divide by number of obs

expected         # Inspect expected values
Out[6]:
democratindependentrepublican
asian15.9606.68815.352
black61.74025.87259.388
hispanic94.539.690.9
other54.18022.70452.116
white193.62081.136186.244
Now we can follow the same steps we took before to calculate the chi-square statistic, the critical value and the p-value:
In [7]:
chi_squared_statistic <-  sum(((voter_table-expected)^2)/expected)  

chi_squared_statistic

qchisq(p=0.95,         # Find the critical value for 95% significance
       df=8)           # Degrees of freedom*

1-pchisq(q=chi_squared_statistic,   # Find the p-value for the chi-square statistic
         df=8)
Out[7]:
9.15281415482329
Out[7]:
15.5073130558655
Out[7]:
0.329569301732227
Note: the degrees of freedom for a test of independence equal the product of the number of categories in each variable minus 1. In this case we have a 5x3 table so df = 4x2 = 8.
As with the goodness-of-fit test, we can use the chisq.test() function to conduct a test of independence automatically:
In [8]:
chisq.test(x=voter_race,   # First variable to test
           y=voter_party)  # Second variable to test
Out[8]:
 Pearson's Chi-squared test

data:  voter_race and voter_party
X-squared = 9.1528, df = 8, p-value = 0.3296
As expected, the test does not detect a significant relationship between the variables.

Wrap Up

Chi-squared tests provide a way to investigate differences in the distributions of categorical variables with the same levels and the dependence between categorical variables with different levels. In the next lesson, we'll learn about a third statistical inference test, the analysis of variance, that lets us compare several sample means at the same time.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.