Discovering relationships between variables is the fundamental goal of data analysis. Frequency tables are a basic tool you can use to explore data and get an idea of the relationships between variables. A frequency table is just a data table that shows the counts of one or more categorical variables.
To explore frequency tables, we'll revisit the Titanic training set from Kaggle that we studied in lesson 13. We will perform a couple of the same preprocessing steps we did in lesson 13:
In [1]:
setwd("C:/Users/Greg/Desktop/Kaggle/titanic")
titanic_train <- read.csv("titanic_train.csv") # Load the data
# Preprocessing steps:
char_cabin <- as.character(titanic_train$Cabin) # Convert cabin to character
new_Cabin <- ifelse(char_cabin == "", # If the value is ""
"", # keep it
substr(char_cabin,1,1)) # Else transform it to a substring
titanic_train$Cabin <- factor(new_Cabin) # Convert back to a factor
str(titanic_train) # Check data structure
One-Way Tables
In lesson 13 we briefly introduced the table() function. The table() function takes one or more factors as input (or vectors of characters/numbers that can be converted to factors) and then produces a frequency table of counts for each unique value in the input. If you only pass one variable to table() the result is a one-way table with counts for each of the variable's levels:
In [2]:
table(titanic_train$Survived) # One-way table of survival
Out[2]:
In [3]:
table(titanic_train$Pclass) # One-way table of passenger class
Out[3]:
In [4]:
table(titanic_train$Sex) # One-way table of sex
Out[4]:
In [5]:
table(titanic_train$Cabin) # One-way table of cabin letter
Out[5]:
Even these simple one-way tables give us some useful insight: we immediately get a sense of distribution of records across the factor levels. For instance, we see that males outnumbered females by a significant margin and that there were more third class passengers than first and second class passengers combined.
If you pass a variable with many unique values to table(), such a numeric variable, it will be converted into a factor and a table will be created, but the counts may not be particularly meaningful.
The output of table() is a table object, which is stored as an array of integers with associated names set equal to the names of the factor levels used to construct the table. The dimensions of the array are determined by the number of variables you pass to table(). When you make a one-way table, you end up with a 1-dimensional array, which is just a vector. This means that the functions and indexing operations we can use on vectors also generally work on one-way tables:
In [6]:
cabin_table <- table(titanic_train$Cabin) # Store a table
sum(cabin_table) # Get the sum of counts (total number of records)
length(cabin_table) # Get the length of the table (number of levels)
cabin_table[2] # Get the second item
cabin_table[3:8] # Get items 3-8
Out[6]:
Out[6]:
Out[6]:
Out[6]:
The table() function ignores NA values by default. If you want to keep NA values in the table, set the optional argument exclude equal to NULL:
In [7]:
age_table <- table(titanic_train$Age, exclude=NULL) # Create table of ages
age_table # Check the age table and confirm it includes NA
Out[7]:
*Note: Tables of numeric variables can be long and messy. It is often preferable to explore numeric variables using a histogram with hist().
You can ignore certain levels when creating a table by passing the levels you want to ignore to the exclude argument:
In [8]:
table(titanic_train$Cabin, exclude=c("")) # Exclude cabins listed as the empty string
Out[8]:
One of the most useful aspects of frequency tables is that they allow you to extract the proportion of the data that belongs to each level. With a one-way table, you can do this by dividing each table value by the total number of records in the table:
In [9]:
gender_table <- table(titanic_train$Sex) # Create a table of genders
gender_table / sum(gender_table) # Divide by the total counts
Out[9]:
Alternatively, you can pass the table to the prop.table() function:
In [10]:
prop.table(gender_table)
Out[10]:
Two-Way Tables
Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give you insight into the relationship between two variables. To create a two way table, simply pass two variables to the table() function instead of one:
In [11]:
# Table of survival vs. sex
survived_sex <- table(titanic_train$Survived, titanic_train$Sex)
rownames(survived_sex) <- c("died","survived") # Rename rows
survived_sex
Out[11]:
In [12]:
# Table of survival vs passenger class
survived_class <- table(titanic_train$Survived, titanic_train$Pclass)
rownames(survived_class) <- c("died","survived") # Rename rows
colnames(survived_class) <- c("class1","class2","class3") # Rename cols
survived_class
Out[12]:
The output of a two-way table is a two-dimensional array of integers where the row names are set to the levels of the first variable and the column names are set to the levels of the second variable. A two dimensional array is just a matrix, so the matrix functions and indexing operations we've learned generally work on two-way tables:
In [13]:
rowSums(survived_sex) # Get row totals
Out[13]:
In [14]:
colSums(survived_sex) # Get column totals
Out[14]:
In [15]:
survived_sex[2,1] # Get the value in row 2, column 1
Out[15]:
We can check the total proportion of counts in each cell using the same method we used for one-way tables:
In [16]:
survived_sex / sum(survived_sex )
Out[16]:
In [17]:
prop.table(survived_sex)
Out[17]:
With two-way tables, we are often more interested in proportions with respect to the counts in each row or column rather than the entire table. For instance, the table above shows us that 26% of the data is made up of women who survived, but this doesn't tell much about the rate at which women survived. To get proportions with respect to rows or columns, use prop.table() with the additional argument margin:
In [18]:
prop.table(survived_sex, margin=1) # Set margin = 1 to get proportions across rows
Out[18]:
In [19]:
prop.table(survived_sex, margin=2) # Set margin = 2 to get proportions across columns
Out[19]:
The second table above gives us more insight into how gender is related to survival: 74% of women survived while only 19% of men did. This information could prove very useful for the task of predicting survival. Indeed a simple model that predicts that all women survived and that all men perished achieves an accuracy rate of 76.55% on the hidden test data used to judge the Titanic Kaggle competition!
Higher Dimensional Tables
The table() function lets you create tables with more than two dimensions. Higher dimensional tables can be a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple variables. Lets create a 3-way table inspecting survival, sex and passenger class:
In [20]:
surv_sex_class <- table(titanic_train$Survived, titanic_train$Sex, titanic_train$Pclass)
dimnames(surv_sex_class) # Check the names of each dimension
Out[20]:
As seen above, you can check the names of each dimension as a list with the dimnames() function. You can also use dimnames() to reassign names to the dimensions:
In [21]:
dimnames(surv_sex_class)[[1]] <- c("died", "survived") # Rename dimension 1
dimnames(surv_sex_class)[[3]] <- c("class1", "class2","class3") # Rename dimension 3
surv_sex_class # Check the 3-way table
Out[21]:
Survival rates appear to be lower for passengers with lower classes but it is hard to get an accurate picture of survival rates by looking at raw counts. Let's dig a little deeper by looking at proportions. Once again, we can use the prop.table() function to extract proportions of counts with respect to different dimensions of the table. In this case, it would be most interesting to consider survival rates of passengers by gender and class. Gender is dimension 2 and passenger class is dimension 3 so we should supply those as the margins to prop.table() to extract the proportions of interest:
In [22]:
prop.table(surv_sex_class, margin= c(2,3) )
Out[22]:
Here we see something quite interesting: over 90% of women in first class and second class survived, but only 50% of women in third class survived. Men in first class also survived at a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.
Wrap Up
Frequency tables are a simple yet effective tool for exploring relationships between variables that take on few unique values. Tables do, however, require you to inspect numerical values and proportions closely and it is not always easy to quickly convey insights drawn from tables to others. Creating plots is a way to visually investigate data, which takes advantage of our innate ability to process and detect patterns in images.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.