Last lesson we introduced the framework of statistical hypothesis testing and the t-test for investigating differences in numeric variables. In this lesson we turn our attention to a common statistical test for categorical variables: the chi-squared test.
Chi-Squared Goodness-Of-Fit Test
In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.
When working with categorical data the values the observations themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves.
Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:
In [1]:
national_demographics <- c(rep("white",100000), # Fake demographic data
rep("hispanic",60000),
rep("black",50000),
rep("asian",15000),
rep("other",35000))
minnesota_demographics <- c(rep("white", 600), # Fake sample data
rep("hispanic", 300),
rep("black", 250),
rep("asian", 75),
rep("other", 150))
table(national_demographics) # Check counts
table(minnesota_demographics)
Out[1]:
Out[1]:
Chi-squared tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the following formula:
In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. Let's calculate the chi-squared statistic for our data to illustrate:
In [2]:
observed <- table(minnesota_demographics)
national_ratios <- prop.table(table(national_demographics)) # Get population ratios
expected <- national_ratios * length(minnesota_demographics) # Get expected counts
expected # Check expected counts
chi_squared_statistic <- sum(((observed-expected)^2)/expected) # Calculate the statistic
chi_squared_statistic
Out[2]:
Out[2]:
*Note: The chi-squared test assumes none of the expected counts are less than 5.
Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. In R, the nickname for the chi-square distribution is "chisq", so we can use the functions rchisq(), pchisq(), qchisq() and dchisq() to work with it like any other probability distribution. Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result:
In [3]:
qchisq(p=0.95, # Find the critical value for 95% significance*
df=4) # Degrees of freedom = number of variable categories - 1
1-pchisq(q=18.1948, # Find the p-value for the chi-square statistic
df=4)
Out[3]:
Out[3]:
*Note: we are only interested in the right tail of the chi-square distribution. Read more on this here.
Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.
You can carry out a chi-squared goodness-of-fit test automatically using the built-in R function chisq.test():
In [4]:
chisq.test(x= observed, # Table of observed counts
p= national_ratios) # Expected proportions
Out[4]:
The test results agree with the values we calculated earlier.
Chi-Squared Test of Independence
Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another. For instance, the month you were born probably doesn't tell you anything which web browser you use, so we'd expect birth month and browser preference to be independent. On the other hand, your month of birth might be related to whether you excelled at sports in school, so month of birth and sports performance might not be independent.
The chi-squared test of independence tests whether two categorical variables are independent. The test of independence is commonly used to determine whether variables like education, political views and other preferences vary based on demographic factors like gender, race and religion. Let's generate some fake voter polling data and perform a test of independence:
In [5]:
set.seed(12)
voter_race <- sample(c("white", "hispanic",
"black", "asian", "other"), # Generate race data
prob = c(0.5, 0.25 ,0.15, 0.05, 0.15),
size=1000,
replace=TRUE)
table(voter_race) # Check counts
voter_party <- sample(c("democrat","republican","independent"), # Generate party data
prob = c(0.4, 0.4, 0.2),
size=1000,
replace=TRUE)
voter_table <- table(voter_race, voter_party)
voter_table
Out[5]:
Out[5]:
Note that we did not use the race data to inform our generation of the party data so the variables are independent.
For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells in the table by taking the rowSums() and colSums() of the table, performing an outer product on them with the outer() function and dividing by the number of observations:
In [6]:
expected <- outer(rowSums(voter_table), # Take the outer product of row and col totals
colSums(voter_table))/sum(voter_table) # Divide by number of obs
expected # Inspect expected values
Out[6]:
Now we can follow the same steps we took before to calculate the chi-square statistic, the critical value and the p-value:
In [7]:
chi_squared_statistic <- sum(((voter_table-expected)^2)/expected)
chi_squared_statistic
qchisq(p=0.95, # Find the critical value for 95% significance
df=8) # Degrees of freedom*
1-pchisq(q=chi_squared_statistic, # Find the p-value for the chi-square statistic
df=8)
Out[7]:
Out[7]:
Out[7]:
Note: the degrees of freedom for a test of independence equal the product of the number of categories in each variable minus 1. In this case we have a 5x3 table so df = 4x2 = 8.
As with the goodness-of-fit test, we can use the chisq.test() function to conduct a test of independence automatically:
In [8]:
chisq.test(x=voter_race, # First variable to test
y=voter_party) # Second variable to test
Out[8]:
As expected, the test does not detect a significant relationship between the variables.
Wrap Up
Chi-squared tests provide a way to investigate differences in the distributions of categorical variables with the same levels and the dependence between categorical variables with different levels. In the next lesson, we'll learn about a third statistical inference test, the analysis of variance, that lets us compare several sample means at the same time.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.