Monday, November 16, 2015

Python for Data Analysis Part 18: Merging Data



* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:




Data you use for your projects won't always be confined to a single table in a CSV or excel file. Data is often split across several tables that you need to combine in some way. Data frames can be joined together if they have columns in common. Joining tables in various ways is a common operation when working with databases but you can also join data frames in Python using functions included with pandas.
First, let's import some libraries and load two tables of related data. You can load the data into your own environment by copying each table below and then using pd.read_clipboard(sep="\t")
In [1]:
import numpy as np
import pandas as pd
import os
In [2]:
table1 = pd.read_clipboard(sep="\t")  

table1
Out[2]:
Unnamed: 0P_IDgenderheightweight
011male71175
122male74225
233female64130
344female64125
455female66165
566male69160
677female62115
788male72250
In [3]:
table2 = pd.read_clipboard(sep="\t")  

table2
Out[3]:
Unnamed: 0P_IDsexvisitscheckupfollow_upillnesssurgeryER
011male110000
122male210001
234female411200
345female1212720
457female210100
568male210100
679male100001
7810female100001
Let's delete the unnamed column:
In [4]:
del table1["Unnamed: 0"]
del table2["Unnamed: 0"]
Both data frames contain the column "P_ID" but the other columns are different. A unique identifier like an ID is usually a good key for joining two data frames together. You can combine two data frames by a common column with merge():
In [5]:
combined1 = pd.merge(table1,       # First table
                    table2,        # Second table
                    how="inner",   # Merge method
                    on="P_ID")     # Column(s) to join on

combined1
Out[5]:
P_IDgenderheightweightsexvisitscheckupfollow_upillnesssurgeryER
01male71175male110000
12male74225male210001
24female64125female411200
35female66165female1212720
47female62115female210100
58male72250male210100
Inspecting the new combined data frame, we can see that the number of records dropped from 8 in the original tables to 6 in the combined table. If we inspect the P_ID column closely, we see that the original data tables contain some different values for P_ID. Note that inside the merge function we set the argument "how" to "inner". An inner join only merges records that appear in both columns used for the join. Since patients 3 and 6 only appear in table1 and patients 9 and 10 only appear in table2, those four patients were dropped when we merged the tables together.
Inner joins ensure that we don't end up introducing missing values in our data. For instance, if we kept patients 3 and 6 in the combined data frame, those patients would end up with a lot of missing values because they aren't present in the table2. If you want to keep more of your data and don't mind introducing some missing values, you can use merge to perform other types of joins, such as left joins, right joins and outer joins:
In [6]:
# A left join keeps all key values in the first(left) data frame

left_join = pd.merge(table1,       # First table
                    table2,        # Second table
                    how="left",   # Merge method
                    on="P_ID")     # Column(s) to join on

left_join
Out[6]:
P_IDgenderheightweightsexvisitscheckupfollow_upillnesssurgeryER
01male71175male110000
12male74225male210001
23female64130NaNNaNNaNNaNNaNNaNNaN
34female64125female411200
45female66165female1212720
56male69160NaNNaNNaNNaNNaNNaNNaN
67female62115female210100
78male72250male210100
In [7]:
# A right join keeps all key values in the second(right) data frame

right_join = pd.merge(table1,       # First table
                    table2,        # Second table
                    how="right",   # Merge method
                    on="P_ID")     # Column(s) to join on

right_join
Out[7]:
P_IDgenderheightweightsexvisitscheckupfollow_upillnesssurgeryER
01male71175male110000
12male74225male210001
24female64125female411200
35female66165female1212720
47female62115female210100
58male72250male210100
69NaNNaNNaNmale100001
710NaNNaNNaNfemale100001
In [8]:
# An outer join keeps all key values in both data frames

outer_join = pd.merge(table1,      # First table
                    table2,        # Second table
                    how="outer",   # Merge method
                    on="P_ID")     # Column(s) to join on

outer_join
Out[8]:
P_IDgenderheightweightsexvisitscheckupfollow_upillnesssurgeryER
01male71175male110000
12male74225male210001
23female64130NaNNaNNaNNaNNaNNaNNaN
34female64125female411200
45female66165female1212720
56male69160NaNNaNNaNNaNNaNNaNNaN
67female62115female210100
78male72250male210100
89NaNNaNNaNmale100001
910NaNNaNNaNfemale100001
By this point, you may have noticed that the two data frames contain a second column in common. The first table contains the column "gender" while the second contains the column "sex", both of which record the same information. We can solve this issue by first renaming one of the two columns so that their names are the same and then supplying that column's name as a second column to merge upon:
In [9]:
table2.rename(columns={"sex":"gender"}, inplace=True) # Rename "sex" column

combined2 = pd.merge(table1,               # First data frame
                  table2,                  # Second data frame
                  how="outer",             # Merge method
                  on=["P_ID","gender"])    # Column(s) to join on

combined2
Out[9]:
P_IDgenderheightweightvisitscheckupfollow_upillnesssurgeryER
01male71175110000
12male74225210001
23female64130NaNNaNNaNNaNNaNNaN
34female64125411200
45female661651212720
56male69160NaNNaNNaNNaNNaNNaN
67female62115210100
78male72250210100
89maleNaNNaN100001
910femaleNaNNaN100001
By renaming and merging on the gender column, we've managed to eliminate some NA values in the outer join. Although an outer joins can introduce NA values, they can also be helpful for discovering patterns in the data. For example, in our combined data, notice that the two patients who did not have values listed for height and weight only made visits to the ER. It could be that the hospital did not have patients 9 and 10 on record previously and that it does not take height and weight measurements for ER visits. Using the same type of intuition, it could be that patients 3 and 6 have height and weight measurements on file from visits in the past, but perhaps they did not visit the hospital during the time period for which the visit data was collected.

Wrap Up

The pandas function merge() can perform common joins to combine data frames with matching columns. For some projects, you may have to merge several tables into one to get the most out of your data.
Now that we know how to prepare and merge data, we're ready to learn more about two of the most common tools for exploring data sets: frequency tables and plots.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.