Monday, July 27, 2015

Introduction to R Part 8: Data Frames


Structured data is usually organized in tables that have a certain number of rows and columns like an Excel spreadsheet or relational database table. R data frames are a type of data structure designed to hold such tabular data. A data frame consists of a number of rows and columns with each column representing some variable or feature of the data and each row representing a record, case or data point. A data frame is similar to a matrix in that it is a 2-dimensional data structure but unlike a matrix, different columns can hold data of different types. A data frame is actually just a list under the hood--a list where each object(column) is a vector with the same number of items.

Creating Data Frames

You can create a new data frame by passing vectors of the same length to the data.frame() function. The vectors you pass in become the columns of the data frame. The data you pass in can be named or unnamed:
In [1]:
a <- c(1,2,3,4,5)                    # Create some vectors
b <- c("Life","Is","Study!","Let's","Learn")
c <- c(TRUE,FALSE,TRUE,TRUE,FALSE)

my_frame <- data.frame(a,b,c)       # Create a new data frame

my_frame
Out[1]:
abc
11LifeTRUE
22IsFALSE
33Study!TRUE
44Let'sTRUE
55LearnFALSE
Since we did not supply column names, the columns took the names of the variables used to create the data frame. We could have assigned column names when creating the data frame like this:
In [2]:
my_frame <- data.frame(numeric = a, character = b, logical = c)

my_frame
Out[2]:
numericcharacterlogical
11LifeTRUE
22IsFALSE
33Study!TRUE
44Let'sTRUE
55LearnFALSE
You can check and reassign column names using the colnames() or names() functions:
In [3]:
colnames(my_frame)

names(my_frame)
Out[3]:
  1. "numeric"
  2.  
  3. "character"
  4.  
  5. "logical"
Out[3]:
  1. "numeric"
  2.  
  3. "character"
  4.  
  5. "logical"
In [4]:
colnames(my_frame) <- c("c1","c2","c3")

colnames(my_frame)
Out[4]:
  1. "c1"
  2.  
  3. "c2"
  4.  
  5. "c3"
Data frames also support named rows. You can create row names when creating a data frame by including the row.names argument and setting it equal to a character vector to be used for row names:
In [5]:
my_frame <- data.frame(numeric = a, character = b, logical = c,
                      row.names = c("r1","r2","r3","r4","r5"))

my_frame
Out[5]:
numericcharacterlogical
r11LifeTRUE
r22IsFALSE
r33Study!TRUE
r44Let'sTRUE
r55LearnFALSE
You can check and alter row names after creating a data frame using the rownames() function:
In [6]:
rownames(my_frame)
Out[6]:
  1. "r1"
  2.  
  3. "r2"
  4.  
  5. "r3"
  6.  
  7. "r4"
  8.  
  9. "r5"
In [7]:
rownames(my_frame) <- 1:5

rownames(my_frame)
Out[7]:
  1. "1"
  2.  
  3. "2"
  4.  
  5. "3"
  6.  
  7. "4"
  8.  
  9. "5"
Another way to create a data frame is to coerce an existing matrix into data frame using the as.data.frame() function:
In [8]:
X <- matrix(seq(10,1000,10),10,10)    #Create a 10 x 10 matrix

X_frame <- as.data.frame(X)           #Turn the matrix into a data frame

X_frame
Out[8]:
V1V2V3V4V5V6V7V8V9V10
110110210310410510610710810910
220120220320420520620720820920
330130230330430530630730830930
440140240340440540640740840940
550150250350450550650750850950
660160260360460560660760860960
770170270370470570670770870970
880180280380480580680780880980
990190290390490590690790890990
101002003004005006007008009001000
In practice, most of the data frames you work with probably won't be data frames you create yourself. When you load data into R for analysis from a tabular data source like an Excel file or comma separated values file (CSV), it is usually structured as data frame. We will cover reading data into R in an upcoming lesson.
For the rest of this lesson we'll work with the mtcars data set, a small set of car-related data built into R.
In [10]:
cars <- mtcars        # Load the mtcars data 

print(cars)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Summarizing Data Frames

When you load new into R, it is a good idea to explore the data to get a sense of the variables and values it contains before moving on to any kind of analysis. Real world data is often very messy and cluttered with things like oddly formatted values and missing (NA) values. Cleaning data to get it into a form that you can work with to perform analysis--often called data munging or data wrangling--can be of the most time intensive tasks necessary to work with data. Data summaries help determine what, if anything, needs to be cleaned.
Data frames support many of the summary functions that apply to matrices and lists. The summary() function is perhaps the most useful as it gives summary statistics for each variable in the data frame:
In [12]:
summary(cars)
Out[12]:
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
The str() function provides a structural overview of a data frame including the number of observations and variables:
In [13]:
str(cars)
'data.frame': 32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
*Note: the environment pane in the upper right corner of RStudio also provides useful summary information for data frames.
If a data frame is large, you won't want to try to print the entire frame to the screen. You can look at a few rows at the beginning or end of a data frame using the head() and tail() functions respectively:
In [15]:
head(cars, 5)     # Look at the first 5 rows of the data frame

tail(cars, 5)     # Look at the last 5 rows of the data frame
Out[15]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Out[15]:
mpgcyldisphpdratwtqsecvsamgearcarb
Lotus Europa30.4495.11133.771.51316.91152
Ford Pantera L15.883512644.223.1714.50154
Ferrari Dino19.761451753.622.7715.50156
Maserati Bora1583013353.543.5714.60158
Volvo 142E21.441211094.112.7818.61142
Data frames support a few other basic summary operations:
In [33]:
dim(cars)      # Get the dimensions of the data frame
Out[33]:
  1. 32
  2.  
  3. 11
In [34]:
nrow(cars)     # Get the number of rows
Out[34]:
32
In [35]:
ncol(cars)     # Get the number of columns
Out[35]:
11

Data Frame Indexing

Since data frame are lists where each list object is a column, they support all indexing operations that apply to lists:
In [37]:
head( mtcars[6]  )      # Single brackets take column slices 

typeof( mtcars[6] )     # And return a new data frame
Out[37]:
wt
Mazda RX42.62
Mazda RX4 Wag2.875
Datsun 7102.32
Hornet 4 Drive3.215
Hornet Sportabout3.44
Valiant3.46
Out[37]:
"list"
In [31]:
head( mtcars[[6]]  )    # Double brackets get the actual object at the index

typeof( mtcars[[6]]  )
Out[31]:
  1. 2.62
  2.  
  3. 2.875
  4.  
  5. 2.32
  6.  
  7. 3.215
  8.  
  9. 3.44
  10.  
  11. 3.46
Out[31]:
"double"
In [32]:
head( mtcars[["wt"]]  )  # Column name notation in double brackets works

head( mtcars$wt  )       # As does the $ notation
Out[32]:
  1. 2.62
  2.  
  3. 2.875
  4.  
  5. 2.32
  6.  
  7. 3.215
  8.  
  9. 3.44
  10.  
  11. 3.46
Out[32]:
  1. 2.62
  2.  
  3. 2.875
  4.  
  5. 2.32
  6.  
  7. 3.215
  8.  
  9. 3.44
  10.  
  11. 3.46
Data frames also support matrix-like indexing by using a single square bracket with a comma separating the index value for the row and column. Matrix indexing allows you get values by row or specific values within the data frame:
In [39]:
cars[2,6]   # Get the value at row 2 column 6
Out[39]:
2.875
In [40]:
cars[2, ]   # Get the second row
Out[40]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX4 Wag2161601103.92.87517.020144
In [41]:
cars[ ,6]   # Get the 6th column
Out[41]:
  1. 2.62
  2.  
  3. 2.875
  4.  
  5. 2.32
  6.  
  7. 3.215
  8.  
  9. 3.44
  10.  
  11. 3.46
  12.  
  13. 3.57
  14.  
  15. 3.19
  16.  
  17. 3.15
  18.  
  19. 3.44
  20.  
  21. 3.44
  22.  
  23. 4.07
  24.  
  25. 3.73
  26. 3.78
  27.  
  28. 5.25
  29.  
  30. 5.424
  31.  
  32. 5.345
  33.  
  34. 2.2
  35.  
  36. 1.615
  37.  
  38. 1.835
  39.  
  40. 2.465
  41.  
  42. 3.52
  43.  
  44. 3.435
  45.  
  46. 3.84
  47.  
  48. 3.845
  49. 1.935
  50.  
  51. 2.14
  52.  
  53. 1.513
  54.  
  55. 3.17
  56.  
  57. 2.77
  58.  
  59. 3.57
  60.  
  61. 2.78
In [43]:
cars["Mazda RX4", ]   # Get a row by using its name
Out[43]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
In [45]:
cars[ ,"mpg"]   # Get a column by using its name
Out[45]:
  1. 21
  2.  
  3. 21
  4.  
  5. 22.8
  6.  
  7. 21.4
  8.  
  9. 18.7
  10.  
  11. 18.1
  12.  
  13. 14.3
  14.  
  15. 24.4
  16.  
  17. 22.8
  18.  
  19. 19.2
  20.  
  21. 17.8
  22.  
  23. 16.4
  24.  
  25. 17.3
  26.  
  27. 15.2
  28. 10.4
  29.  
  30. 10.4
  31.  
  32. 14.7
  33.  
  34. 32.4
  35.  
  36. 30.4
  37.  
  38. 33.9
  39.  
  40. 21.5
  41.  
  42. 15.5
  43.  
  44. 15.2
  45.  
  46. 13.3
  47.  
  48. 19.2
  49.  
  50. 27.3
  51.  
  52. 26
  53. 30.4
  54.  
  55. 15.8
  56.  
  57. 19.7
  58.  
  59. 15
  60.  
  61. 21.4
All of the indexing methods shown in previous lessons still apply, even logical indexing:
In [50]:
cars[(cars$mpg > 25), ]   # Get rows where mpg is greater than 25
Out[50]:
mpgcyldisphpdratwtqsecvsamgearcarb
Fiat 12832.4478.7664.082.219.471141
Honda Civic30.4475.7524.931.61518.521142
Toyota Corolla33.9471.1654.221.83519.91141
Fiat X1-927.3479664.081.93518.91141
Porsche 914-2264120.3914.432.1416.70152
Lotus Europa30.4495.11133.771.51316.91152
Instead of logical indexing, you can also use the subset() function to create data frame subsets based on logical statements. subset() takes the data frame as the first argument and then a logical statement as the second argument create a subset:
In [55]:
subset(cars, (mpg > 20) & (hp > 70))   # Subset with over 20 mpg and 70 horsepower
Out[55]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Merc 23022.84140.8953.923.1522.91042
Toyota Corona21.54120.1973.72.46520.011031
Porsche 914-2264120.3914.432.1416.70152
Lotus Europa30.4495.11133.771.51316.91152
Volvo 142E21.441211094.112.7818.61142
The matrix functions cbind() and rbind() we covered in part 6 work on data frames, providing an easy way to combine two data frames with the same number of rows or columns.
You can also delete columns in a data frame by assigning them a value of NULL:
In [73]:
cars$vs <- NULL     # Drop the column "vs"

cars$carb <- NULL   # Drop the column "carb"
In [74]:
subset(cars, (mpg > 20) & (hp > 70))
Out[74]:
mpgcyldisphpdratwtqsecamgear
Mazda RX42161601103.92.6216.4614
Mazda RX4 Wag2161601103.92.87517.0214
Datsun 71022.84108933.852.3218.6114
Hornet 4 Drive21.462581103.083.21519.4403
Merc 23022.84140.8953.923.1522.904
Toyota Corona21.54120.1973.72.46520.0103
Porsche 914-2264120.3914.432.1416.715
Lotus Europa30.4495.11133.771.51316.915
Volvo 142E21.441211094.112.7818.614
You cannot drop rows by assigning them a value of NULL due to the way data frames are stored as lists of columns. If you want to drop rows, you can use matrix-style subsetting with the -operator:
In [81]:
cars <- cars[-c(1, 3), ]    # Drop rows 1 and 3

head( cars )                # Note Mazda RX4 and Datsun 710 have been removed
Out[81]:
mpgcyldisphpdratwtqsecamgear
Mazda RX4 Wag2161601103.92.87517.0214
Hornet 4 Drive21.462581103.083.21519.4403
Hornet Sportabout18.783601753.153.4417.0203
Valiant18.162251052.763.4620.2203
Duster 36014.383602453.213.5715.8403
Merc 240D24.44146.7623.693.192004
Data frames are one of the main reasons R is a good tool for working with data. Data in many common formats translate directly into R data frames and they are easy to summarize and subset.
Before we learn how to read data into R, there's one more data structure we need to discuss. Earlier in this lesson we created a data frame called my_frame with a column name "character":
In [58]:
my_frame
Out[58]:
numericcharacterlogical
11LifeTRUE
22IsFALSE
33Study!TRUE
44Let'sTRUE
55LearnFALSE
If we check the type of column "character", we have a surprise in store:
In [65]:
typeof( my_frame$character )
Out[65]:
"integer"
How can a column that appears to hold characters be of type integer? It turns out that when you create a data frame, all character vectors in the data frame are converted into a special data structure called a factor by default. You can suppress this behavior by including the argument "stringsAsFactors = FALSE" when creating a data frame:
In [66]:
my_frame <- data.frame(numeric = a, character = b, logical = c, 
                       stringsAsFactors = FALSE)

typeof( my_frame$character )
Out[66]:
"character"
Is the coercion of characters to factors reasonable default behavior? You'll be prepared to make your own judgement on that after the next lesson.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.