* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:
Discovering relationships between variables is the fundamental goal of data analysis. Frequency tables are a basic tool you can use to explore data and get an idea of the relationships between variables. A frequency table is just a data table that shows the counts of one or more categorical variables.
To explore frequency tables, we'll revisit the Titanic training set from Kaggle that we studied in lesson 14. We will perform a couple of the same preprocessing steps we did in lesson 14:
In [1]:
import numpy as np
import pandas as pd
import os
In [2]:
os.chdir('C:\\Users\\Greg\\Desktop\\Kaggle\\titanic') # Set working directory
titanic_train = pd.read_csv("titanic_train.csv") # Read the data
char_cabin = titanic_train["Cabin"].astype(str) # Convert cabin to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter
titanic_train["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var
One-Way Tables
Create frequency tables (also known as crosstabs) in pandas using the pd.crosstab() function. The function takes one or more array-like objects as indexes or columns and then constructs a new DataFrame of variable counts based on the supplied arrays. Let's make a one-way table of the survived variable:
In [3]:
my_tab = pd.crosstab(index=titanic_train["Survived"], # Make a crosstab
columns="count") # Name the count column
my_tab
Out[3]:
In [4]:
type(my_tab) # Confirm that the crosstab is a DataFrame
Out[4]:
Let's make a couple more crosstabs to explore other variables:
In [5]:
pd.crosstab(index=titanic_train["Pclass"], # Make a crosstab
columns="count") # Name the count column
Out[5]:
In [6]:
pd.crosstab(index=titanic_train["Sex"], # Make a crosstab
columns="count") # Name the count column
Out[6]:
In [7]:
cabin_tab = pd.crosstab(index=titanic_train["Cabin"], # Make a crosstab
columns="count") # Name the count column
cabin_tab
Out[7]:
Even these simple one-way tables give us some useful insight: we immediately get a sense of distribution of records across the categories. For instance, we see that males outnumbered females by a significant margin and that there were more third class passengers than first and second class passengers combined.
If you pass a variable with many unique values to table(), such a numeric variable, it will still produce a table of counts for each unique value, but the counts may not be particularly meaningful.
Since the crosstab function produces DataFrames, the DataFrame operations we've learned work on crosstabs:
In [8]:
print (cabin_tab.sum(), "\n") # Sum the counts
print (cabin_tab.shape, "\n") # Check number of rows and cols
cabin_tab.iloc[1:7] # Slice rows 1-6
Out[8]:
One of the most useful aspects of frequency tables is that they allow you to extract the proportion of the data that belongs to each category. With a one-way table, you can do this by dividing each table value by the total number of records in the table:
In [9]:
cabin_tab/cabin_tab.sum()
Out[9]:
Two-Way Tables
Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give you insight into the relationship between two variables. To create a two way table, pass two variables to the pd.crosstab() function instead of one:
In [10]:
# Table of survival vs. sex
survived_sex = pd.crosstab(index=titanic_train["Survived"],
columns=titanic_train["Sex"])
survived_sex.index= ["died","survived"]
survived_sex
Out[10]:
In [11]:
# Table of survival vs passenger class
survived_class = pd.crosstab(index=titanic_train["Survived"],
columns=titanic_train["Pclass"])
survived_class.columns = ["class1","class2","class3"]
survived_class.index= ["died","survived"]
survived_class
Out[11]:
You can get the marginal counts (totals for each row and column) by including the argument margins=True:
In [12]:
# Table of survival vs passenger class
survived_class = pd.crosstab(index=titanic_train["Survived"],
columns=titanic_train["Pclass"],
margins=True) # Include row and column totals
survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]
survived_class
Out[12]:
To get the total proportion of counts in each cell, divide the table by the grand total:
In [13]:
survived_class/survived_class.ix["coltotal","rowtotal"]
Out[13]:
To get the proportion of counts along each column (in this case, the survival rate within each passenger class) divide by the column totals:
In [14]:
survived_class/survived_class.ix["coltotal"]
Out[14]:
To get the proportion of counts along each row divide by the row totals. The division operator functions on a row-by-row basis when used on DataFrames by default. In this case we want to divide each column by the rowtotals column. To get division to work on a column by column basis, use df.div() with the axis set to 0 (or "index"):
In [15]:
survived_class.div(survived_class["rowtotal"],
axis=0)
Out[15]:
Alternatively, you can transpose the table with df.T to swap rows and columns and perform row by row division as normal:
In [16]:
survived_class.T/survived_class["rowtotal"]
Out[16]:
Higher Dimensional Tables
The crosstab() function lets you create tables out of more than two categories. Higher dimensional tables can be a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple variables. Let's create a 3-way table inspecting survival, sex and passenger class:
In [17]:
surv_sex_class = pd.crosstab(index=titanic_train["Survived"],
columns=[titanic_train["Pclass"],
titanic_train["Sex"]],
margins=True) # Include row and column totals
surv_sex_class
Out[17]:
Notice that by passing a second variable to the columns argument, the resulting table has columns categorized by both Pclass and Sex. The outermost index (Pclass) returns sections of the table instead of individual columns:
In [18]:
surv_sex_class[2] # Get the subtable under Pclass 2
Out[18]:
The secondary column index, Sex, can't be used as a top level index, but it can be used within a given Pclass:
In [19]:
surv_sex_class[2]["female"] # Get female column within Pclass 2
Out[19]:
Due to the convenient hierarchical structure of the table, we still use one division to get the proportion of survival across each column:
In [20]:
surv_sex_class/surv_sex_class.ix["All"] # Divide by column totals
Out[20]:
Here we see something quite interesting: over 90% of women in first class and second class survived, but only 50% of women in third class survived. Men in first class also survived at a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.
Wrap Up
Frequency tables are a simple yet effective tool for exploring relationships between variables that take on few unique values. Tables do, however, require you to inspect numerical values and proportions closely and it is not always easy to quickly convey insights drawn from tables to others. Creating plots is a way to visually investigate data, which takes advantage of our innate ability to process and detect patterns in images.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.