Wednesday, August 12, 2015

Introduction to R Part 17: Merging Data


Data you use for your projects won't always be confined to a single table in a CSV or excel file. Data is often split across several tables that you need to combine in some way. Data frames can be joined together if they have columns in common. Joining tables in various ways is a common operation when working with databases but you can also join data frames in R using the built in function merge().
First, let's load and inspect two related tables of fake medical data. You can load the data into your own R environment by copying each table below and then using read.csv("clipboard", sep="\t")
In [1]:
setwd("C:/Users/Greg/Desktop")

table1 <- read.csv("intro_R_17_1.csv")  

table2 <- read.csv("intro_R_17_2.csv")

table1            # Inspect the data
table2
Out[1]:
P_IDgenderheightweight
11male71175
22male74225
33female64130
44female64125
55female66165
66male69160
77female62115
88male72250
Out[1]:
P_IDsexvisitscheckupfollow_upillnesssurgeryER
11male110000
22male210001
34female411200
45female1212720
57female210100
68male210100
79male100001
810female100001
*Note: your data will have an extra column "X" for row names if you read from the clipboard.
Both data frames contain the column "P_ID" but the other columns are different. A unique identifier like an ID is usually a good key for joining two data frames together. You can combine two data frames by a common column with merge():
In [2]:
combined1 <- merge(x = table1,              # First data frame
                   y = table2,              # Second data frame
                   by = "P_ID")             # Common column to merge on

print(combined1)
  P_ID gender height weight    sex visits checkup follow_up illness surgery ER
1    1   male     71    175   male      1       1         0       0       0  0
2    2   male     74    225   male      2       1         0       0       0  1
3    4 female     64    125 female      4       1         1       2       0  0
4    5 female     66    165 female     12       1         2       7       2  0
5    7 female     62    115 female      2       1         0       1       0  0
6    8   male     72    250   male      2       1         0       1       0  0
Inspecting the new combined data frame, we can see that the number of records dropped from 8 in the original tables to 6 in the combined table. If we inspect the P_ID column closely, we see that the original data tables contain some different values for P_ID. The merge() function performs an "inner join" by default. An inner join only merges records that appear in both columns used for the join. Since patients 3 and 6 only appear in table1 and patients 9 and 10 only appear in table2, those four patients were dropped when we merged the tables together.
Inner joins ensure that we don't end up introducing NA values in our data. For instance, if we kept patients 3 and 6 in the combined data frame, those patients would end up with a lot of missing values because they aren't present in the table2. If you want to keep more of your data and don't mind introducing some missing values, you can use merge to perform other types of joins, such as left joins, right joins and outer joins, by supplying an extra argument to merge():
In [3]:
# A left join keeps all key values in the first(left) data frame

left_join <- merge(x = table1,              # First data frame
                   y = table2,              # Second data frame
                   by = "P_ID",             # Common column to merge on
                   all.x = TRUE)            # Keep all key values in x?

print(left_join)
  P_ID gender height weight    sex visits checkup follow_up illness surgery ER
1    1   male     71    175   male      1       1         0       0       0  0
2    2   male     74    225   male      2       1         0       0       0  1
3    3 female     64    130   <NA>     NA      NA        NA      NA      NA NA
4    4 female     64    125 female      4       1         1       2       0  0
5    5 female     66    165 female     12       1         2       7       2  0
6    6   male     69    160   <NA>     NA      NA        NA      NA      NA NA
7    7 female     62    115 female      2       1         0       1       0  0
8    8   male     72    250   male      2       1         0       1       0  0
In [4]:
# A right join keeps all key values in the second(right) data frame

right_join <- merge(x = table1,             # First data frame
                   y = table2,              # Second data frame
                   by = "P_ID",             # Common column to merge on
                   all.y = TRUE)            # Keep all key values in y?

print(right_join)
  P_ID gender height weight    sex visits checkup follow_up illness surgery ER
1    1   male     71    175   male      1       1         0       0       0  0
2    2   male     74    225   male      2       1         0       0       0  1
3    4 female     64    125 female      4       1         1       2       0  0
4    5 female     66    165 female     12       1         2       7       2  0
5    7 female     62    115 female      2       1         0       1       0  0
6    8   male     72    250   male      2       1         0       1       0  0
7    9   <NA>     NA     NA   male      1       0         0       0       0  1
8   10   <NA>     NA     NA female      1       0         0       0       0  1
In [5]:
# An outer join keeps all key values in both data frames

outer_join <- merge(x = table1,             # First data frame
                   y = table2,              # Second data frame
                   by = "P_ID",             # Common column to merge on
                   all = TRUE)              # Keep all key values? (default=FALSE)

print(outer_join)
   P_ID gender height weight    sex visits checkup follow_up illness surgery ER
1     1   male     71    175   male      1       1         0       0       0  0
2     2   male     74    225   male      2       1         0       0       0  1
3     3 female     64    130   <NA>     NA      NA        NA      NA      NA NA
4     4 female     64    125 female      4       1         1       2       0  0
5     5 female     66    165 female     12       1         2       7       2  0
6     6   male     69    160   <NA>     NA      NA        NA      NA      NA NA
7     7 female     62    115 female      2       1         0       1       0  0
8     8   male     72    250   male      2       1         0       1       0  0
9     9   <NA>     NA     NA   male      1       0         0       0       0  1
10   10   <NA>     NA     NA female      1       0         0       0       0  1
By this point you may have noticed that the two data frames contain a second column in common. The first table contains the column "gender" while the second contains the column "sex", both of which record the same information. We can solve this issue by first renaming one of the two columns so that their names are the same and then supplying that column's name as a second column to merge upon:
In [6]:
names(table2)[2] <- "gender"                    # Rename the second column of table2

combined2 <- merge(x = table1,                  # First data frame
                   y = table2,                  # Second data frame
                   by = c("P_ID","gender"),     # Common columns to merge on
                   all = TRUE)                  # Perform an outer join

print(combined2)
   P_ID gender height weight visits checkup follow_up illness surgery ER
1     1   male     71    175      1       1         0       0       0  0
2     2   male     74    225      2       1         0       0       0  1
3     3 female     64    130     NA      NA        NA      NA      NA NA
4     4 female     64    125      4       1         1       2       0  0
5     5 female     66    165     12       1         2       7       2  0
6     6   male     69    160     NA      NA        NA      NA      NA NA
7     7 female     62    115      2       1         0       1       0  0
8     8   male     72    250      2       1         0       1       0  0
9     9   male     NA     NA      1       0         0       0       0  1
10   10 female     NA     NA      1       0         0       0       0  1
By correcting, renaming and merging on the gender column, we've managed to eliminate some NA values in the outer join. Although an outer joins can introduce NA values, they can also be helpful for discovering patterns in the data. For example, in our combined data, notice that the two patients who did not have values listed for height and weight only made visits to the ER. It could be that the hospital did not have patients 9 and 10 on record previously and that it does not take height and weight measurements for ER visits. Using the same type of intuition, it could be that patients 3 and 6 have height and weight measurements on file from visits in the past, but perhaps they did not visit the hospital during the time period for which the visit data was collected.

Wrap Up

The base R function merge() can perform common joins to combine data frames with matching columns. For some projects, you may have to merge several tables into one to get the most out of your data.
Now that we know how to prepare and merge data, we're ready to learn more about two of the most common tools for exploring data sets: frequency tables and plots.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.