Thursday, August 13, 2015

Introduction to R Part 18: Frequency Tables



Discovering relationships between variables is the fundamental goal of data analysis. Frequency tables are a basic tool you can use to explore data and get an idea of the relationships between variables. A frequency table is just a data table that shows the counts of one or more categorical variables.
To explore frequency tables, we'll revisit the Titanic training set from Kaggle that we studied in lesson 13. We will perform a couple of the same preprocessing steps we did in lesson 13:
In [1]:
setwd("C:/Users/Greg/Desktop/Kaggle/titanic") 

titanic_train <- read.csv("titanic_train.csv")      # Load the data

# Preprocessing steps:

char_cabin <- as.character(titanic_train$Cabin)     # Convert cabin to character

new_Cabin <- ifelse(char_cabin == "",            # If the value is ""
                    "",                          # keep it
                    substr(char_cabin,1,1))      # Else transform it to a substring

titanic_train$Cabin <- factor(new_Cabin)         # Convert back to a factor

str(titanic_train)                               # Check data structure
'data.frame': 889 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 889 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 558 519 628 416 580 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 680 levels "110152","110413",..: 523 596 669 49 472 275 85 395 344 132 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 8 levels "","A","B","C",..: 1 4 1 4 1 1 6 1 1 1 ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

One-Way Tables

In lesson 13 we briefly introduced the table() function. The table() function takes one or more factors as input (or vectors of characters/numbers that can be converted to factors) and then produces a frequency table of counts for each unique value in the input. If you only pass one variable to table() the result is a one-way table with counts for each of the variable's levels:
In [2]:
table(titanic_train$Survived)            # One-way table of survival
Out[2]:
  0   1 
549 340 
In [3]:
table(titanic_train$Pclass)              # One-way table of passenger class
Out[3]:
  1   2   3 
214 184 491 
In [4]:
table(titanic_train$Sex)                 # One-way table of sex
Out[4]:
female   male 
   312    577 
In [5]:
table(titanic_train$Cabin)               # One-way table of cabin letter
Out[5]:
      A   B   C   D   E   F   G 
688  15  45  59  33  32  13   4 
Even these simple one-way tables give us some useful insight: we immediately get a sense of distribution of records across the factor levels. For instance, we see that males outnumbered females by a significant margin and that there were more third class passengers than first and second class passengers combined.
If you pass a variable with many unique values to table(), such a numeric variable, it will be converted into a factor and a table will be created, but the counts may not be particularly meaningful.
The output of table() is a table object, which is stored as an array of integers with associated names set equal to the names of the factor levels used to construct the table. The dimensions of the array are determined by the number of variables you pass to table(). When you make a one-way table, you end up with a 1-dimensional array, which is just a vector. This means that the functions and indexing operations we can use on vectors also generally work on one-way tables:
In [6]:
cabin_table <- table(titanic_train$Cabin)   # Store a table

sum(cabin_table)             # Get the sum of counts (total number of records)

length(cabin_table)          # Get the length of the table (number of levels)

cabin_table[2]               # Get the second item

cabin_table[3:8]             # Get items 3-8
Out[6]:
889
Out[6]:
8
Out[6]:
A: 15
Out[6]:
B
45
C
59
D
33
E
32
F
13
G
4
The table() function ignores NA values by default. If you want to keep NA values in the table, set the optional argument exclude equal to NULL:
In [7]:
age_table <- table(titanic_train$Age, exclude=NULL)   # Create table of ages

age_table                  # Check the age table and confirm it includes NA
Out[7]:
0.42 0.67 0.75 0.83 0.92    1    2    3    4    5    6    7    8    9   10   11 
   1    1    2    2    1    7   10    6   10    4    3    3    4    8    2    4 
  12   13   14 14.5   15   16   17   18   19   20 20.5   21   22   23 23.5   24 
   1    2    6    1    5   17   13   26   25   15    1   24   27   15    1   30 
24.5   25   26   27   28 28.5   29   30 30.5   31   32 32.5   33   34 34.5   35 
   1   23   18   18   25    2   20   25    2   17   18    2   15   15    1   18 
  36 36.5   37   38   39   40 40.5   41   42   43   44   45 45.5   46   47   48 
  22    1    6   10   14   13    2    6   13    5    9   12    2    3    9    9 
  49   50   51   52   53   54   55 55.5   56   57   58   59   60   61   62   63 
   6   10    7    6    1    8    2    1    4    2    5    2    4    3    3    2 
  64   65   66   70 70.5   71   74   80 <NA> 
   2    3    1    2    1    2    1    1  177 
*Note: Tables of numeric variables can be long and messy. It is often preferable to explore numeric variables using a histogram with hist().
You can ignore certain levels when creating a table by passing the levels you want to ignore to the exclude argument:
In [8]:
table(titanic_train$Cabin, exclude=c(""))  # Exclude cabins listed as the empty string
Out[8]:
 A  B  C  D  E  F  G 
15 45 59 33 32 13  4 
One of the most useful aspects of frequency tables is that they allow you to extract the proportion of the data that belongs to each level. With a one-way table, you can do this by dividing each table value by the total number of records in the table:
In [9]:
gender_table <- table(titanic_train$Sex) # Create a table of genders

gender_table / sum(gender_table)         # Divide by the total counts
Out[9]:
   female      male 
0.3509561 0.6490439 
Alternatively, you can pass the table to the prop.table() function:
In [10]:
prop.table(gender_table)
Out[10]:
   female      male 
0.3509561 0.6490439 

Two-Way Tables

Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give you insight into the relationship between two variables. To create a two way table, simply pass two variables to the table() function instead of one:
In [11]:
# Table of survival vs. sex
survived_sex <- table(titanic_train$Survived, titanic_train$Sex)

rownames(survived_sex) <- c("died","survived")   # Rename rows

survived_sex
Out[11]:
          
           female male
  died         81  468
  survived    231  109
In [12]:
# Table of survival vs passenger class
survived_class <- table(titanic_train$Survived, titanic_train$Pclass)

rownames(survived_class) <- c("died","survived")            # Rename rows
colnames(survived_class) <- c("class1","class2","class3")   # Rename cols

survived_class
Out[12]:
          
           class1 class2 class3
  died         80     97    372
  survived    134     87    119
The output of a two-way table is a two-dimensional array of integers where the row names are set to the levels of the first variable and the column names are set to the levels of the second variable. A two dimensional array is just a matrix, so the matrix functions and indexing operations we've learned generally work on two-way tables:
In [13]:
rowSums(survived_sex)     # Get row totals
Out[13]:
died
549
survived
340
In [14]:
colSums(survived_sex)     # Get column totals
Out[14]:
female
312
male
577
In [15]:
survived_sex[2,1]        # Get the value in row 2, column 1
Out[15]:
231
We can check the total proportion of counts in each cell using the same method we used for one-way tables:
In [16]:
survived_sex / sum(survived_sex )
Out[16]:
          
               female       male
  died     0.09111361 0.52643420
  survived 0.25984252 0.12260967
In [17]:
prop.table(survived_sex)
Out[17]:
          
               female       male
  died     0.09111361 0.52643420
  survived 0.25984252 0.12260967
With two-way tables, we are often more interested in proportions with respect to the counts in each row or column rather than the entire table. For instance, the table above shows us that 26% of the data is made up of women who survived, but this doesn't tell much about the rate at which women survived. To get proportions with respect to rows or columns, use prop.table() with the additional argument margin:
In [18]:
prop.table(survived_sex, margin=1)  # Set margin = 1 to get proportions across rows
Out[18]:
          
              female      male
  died     0.1475410 0.8524590
  survived 0.6794118 0.3205882
In [19]:
prop.table(survived_sex, margin=2)  # Set margin = 2 to get proportions across columns
Out[19]:
          
              female      male
  died     0.2596154 0.8110919
  survived 0.7403846 0.1889081
The second table above gives us more insight into how gender is related to survival: 74% of women survived while only 19% of men did. This information could prove very useful for the task of predicting survival. Indeed a simple model that predicts that all women survived and that all men perished achieves an accuracy rate of 76.55% on the hidden test data used to judge the Titanic Kaggle competition!

Higher Dimensional Tables

The table() function lets you create tables with more than two dimensions. Higher dimensional tables can be a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple variables. Lets create a 3-way table inspecting survival, sex and passenger class:
In [20]:
surv_sex_class <- table(titanic_train$Survived, titanic_train$Sex, titanic_train$Pclass)

dimnames(surv_sex_class)           # Check the names of each dimension
Out[20]:
[[1]]
  1. "0"
  2.  
  3. "1"
[[2]]
  1. "female"
  2.  
  3. "male"
[[3]]
  1. "1"
  2.  
  3. "2"
  4.  
  5. "3"
As seen above, you can check the names of each dimension as a list with the dimnames() function. You can also use dimnames() to reassign names to the dimensions:
In [21]:
dimnames(surv_sex_class)[[1]] <- c("died", "survived")           # Rename dimension 1
dimnames(surv_sex_class)[[3]] <- c("class1", "class2","class3")  # Rename dimension 3

surv_sex_class                          # Check the 3-way table
Out[21]:
, ,  = class1

          
           female male
  died          3   77
  survived     89   45

, ,  = class2

          
           female male
  died          6   91
  survived     70   17

, ,  = class3

          
           female male
  died         72  300
  survived     72   47
Survival rates appear to be lower for passengers with lower classes but it is hard to get an accurate picture of survival rates by looking at raw counts. Let's dig a little deeper by looking at proportions. Once again, we can use the prop.table() function to extract proportions of counts with respect to different dimensions of the table. In this case, it would be most interesting to consider survival rates of passengers by gender and class. Gender is dimension 2 and passenger class is dimension 3 so we should supply those as the margins to prop.table() to extract the proportions of interest:
In [22]:
prop.table(surv_sex_class, margin= c(2,3) )
Out[22]:
, ,  = class1

          
               female       male
  died     0.03260870 0.63114754
  survived 0.96739130 0.36885246

, ,  = class2

          
               female       male
  died     0.07894737 0.84259259
  survived 0.92105263 0.15740741

, ,  = class3

          
               female       male
  died     0.50000000 0.86455331
  survived 0.50000000 0.13544669
Here we see something quite interesting: over 90% of women in first class and second class survived, but only 50% of women in third class survived. Men in first class also survived at a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.

Wrap Up

Frequency tables are a simple yet effective tool for exploring relationships between variables that take on few unique values. Tables do, however, require you to inspect numerical values and proportions closely and it is not always easy to quickly convey insights drawn from tables to others. Creating plots is a way to visually investigate data, which takes advantage of our innate ability to process and detect patterns in images.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.