Structured data is usually organized in tables that have a certain number of rows and columns like an Excel spreadsheet or relational database table. R data frames are a type of data structure designed to hold such tabular data. A data frame consists of a number of rows and columns with each column representing some variable or feature of the data and each row representing a record, case or data point. A data frame is similar to a matrix in that it is a 2-dimensional data structure but unlike a matrix, different columns can hold data of different types. A data frame is actually just a list under the hood--a list where each object(column) is a vector with the same number of items.
Creating Data Frames
You can create a new data frame by passing vectors of the same length to the data.frame() function. The vectors you pass in become the columns of the data frame. The data you pass in can be named or unnamed:
In [1]:
a <- c(1,2,3,4,5) # Create some vectors
b <- c("Life","Is","Study!","Let's","Learn")
c <- c(TRUE,FALSE,TRUE,TRUE,FALSE)
my_frame <- data.frame(a,b,c) # Create a new data frame
my_frame
Out[1]:
Since we did not supply column names, the columns took the names of the variables used to create the data frame. We could have assigned column names when creating the data frame like this:
In [2]:
my_frame <- data.frame(numeric = a, character = b, logical = c)
my_frame
Out[2]:
You can check and reassign column names using the colnames() or names() functions:
In [3]:
colnames(my_frame)
names(my_frame)
Out[3]:
Out[3]:
In [4]:
colnames(my_frame) <- c("c1","c2","c3")
colnames(my_frame)
Out[4]:
Data frames also support named rows. You can create row names when creating a data frame by including the row.names argument and setting it equal to a character vector to be used for row names:
In [5]:
my_frame <- data.frame(numeric = a, character = b, logical = c,
row.names = c("r1","r2","r3","r4","r5"))
my_frame
Out[5]:
You can check and alter row names after creating a data frame using the rownames() function:
In [6]:
rownames(my_frame)
Out[6]:
In [7]:
rownames(my_frame) <- 1:5
rownames(my_frame)
Out[7]:
Another way to create a data frame is to coerce an existing matrix into data frame using the as.data.frame() function:
In [8]:
X <- matrix(seq(10,1000,10),10,10) #Create a 10 x 10 matrix
X_frame <- as.data.frame(X) #Turn the matrix into a data frame
X_frame
Out[8]:
In practice, most of the data frames you work with probably won't be data frames you create yourself. When you load data into R for analysis from a tabular data source like an Excel file or comma separated values file (CSV), it is usually structured as data frame. We will cover reading data into R in an upcoming lesson.
For the rest of this lesson we'll work with the mtcars data set, a small set of car-related data built into R.
In [10]:
cars <- mtcars # Load the mtcars data
print(cars)
Summarizing Data Frames
When you load new into R, it is a good idea to explore the data to get a sense of the variables and values it contains before moving on to any kind of analysis. Real world data is often very messy and cluttered with things like oddly formatted values and missing (NA) values. Cleaning data to get it into a form that you can work with to perform analysis--often called data munging or data wrangling--can be of the most time intensive tasks necessary to work with data. Data summaries help determine what, if anything, needs to be cleaned.
Data frames support many of the summary functions that apply to matrices and lists. The summary() function is perhaps the most useful as it gives summary statistics for each variable in the data frame:
In [12]:
summary(cars)
Out[12]:
The str() function provides a structural overview of a data frame including the number of observations and variables:
In [13]:
str(cars)
*Note: the environment pane in the upper right corner of RStudio also provides useful summary information for data frames.
If a data frame is large, you won't want to try to print the entire frame to the screen. You can look at a few rows at the beginning or end of a data frame using the head() and tail() functions respectively:
In [15]:
head(cars, 5) # Look at the first 5 rows of the data frame
tail(cars, 5) # Look at the last 5 rows of the data frame
Out[15]:
Out[15]:
Data frames support a few other basic summary operations:
In [33]:
dim(cars) # Get the dimensions of the data frame
Out[33]:
In [34]:
nrow(cars) # Get the number of rows
Out[34]:
In [35]:
ncol(cars) # Get the number of columns
Out[35]:
Data Frame Indexing
Since data frame are lists where each list object is a column, they support all indexing operations that apply to lists:
In [37]:
head( mtcars[6] ) # Single brackets take column slices
typeof( mtcars[6] ) # And return a new data frame
Out[37]:
Out[37]:
In [31]:
head( mtcars[[6]] ) # Double brackets get the actual object at the index
typeof( mtcars[[6]] )
Out[31]:
Out[31]:
In [32]:
head( mtcars[["wt"]] ) # Column name notation in double brackets works
head( mtcars$wt ) # As does the $ notation
Out[32]:
Out[32]:
Data frames also support matrix-like indexing by using a single square bracket with a comma separating the index value for the row and column. Matrix indexing allows you get values by row or specific values within the data frame:
In [39]:
cars[2,6] # Get the value at row 2 column 6
Out[39]:
In [40]:
cars[2, ] # Get the second row
Out[40]:
In [41]:
cars[ ,6] # Get the 6th column
Out[41]:
In [43]:
cars["Mazda RX4", ] # Get a row by using its name
Out[43]:
In [45]:
cars[ ,"mpg"] # Get a column by using its name
Out[45]:
All of the indexing methods shown in previous lessons still apply, even logical indexing:
In [50]:
cars[(cars$mpg > 25), ] # Get rows where mpg is greater than 25
Out[50]:
Instead of logical indexing, you can also use the subset() function to create data frame subsets based on logical statements. subset() takes the data frame as the first argument and then a logical statement as the second argument create a subset:
In [55]:
subset(cars, (mpg > 20) & (hp > 70)) # Subset with over 20 mpg and 70 horsepower
Out[55]:
The matrix functions cbind() and rbind() we covered in part 6 work on data frames, providing an easy way to combine two data frames with the same number of rows or columns.
You can also delete columns in a data frame by assigning them a value of NULL:
In [73]:
cars$vs <- NULL # Drop the column "vs"
cars$carb <- NULL # Drop the column "carb"
In [74]:
subset(cars, (mpg > 20) & (hp > 70))
Out[74]:
You cannot drop rows by assigning them a value of NULL due to the way data frames are stored as lists of columns. If you want to drop rows, you can use matrix-style subsetting with the -operator:
In [81]:
cars <- cars[-c(1, 3), ] # Drop rows 1 and 3
head( cars ) # Note Mazda RX4 and Datsun 710 have been removed
Out[81]:
Data frames are one of the main reasons R is a good tool for working with data. Data in many common formats translate directly into R data frames and they are easy to summarize and subset.
Before we learn how to read data into R, there's one more data structure we need to discuss. Earlier in this lesson we created a data frame called my_frame with a column name "character":
In [58]:
my_frame
Out[58]:
If we check the type of column "character", we have a surprise in store:
In [65]:
typeof( my_frame$character )
Out[65]:
How can a column that appears to hold characters be of type integer? It turns out that when you create a data frame, all character vectors in the data frame are converted into a special data structure called a factor by default. You can suppress this behavior by including the argument "stringsAsFactors = FALSE" when creating a data frame:
In [66]:
my_frame <- data.frame(numeric = a, character = b, logical = c,
stringsAsFactors = FALSE)
typeof( my_frame$character )
Out[66]:
Is the coercion of characters to factors reasonable default behavior? You'll be prepared to make your own judgement on that after the next lesson.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.