At the end of the last lesson, we learned that character data loaded into a data frame is converted into a special data structure called a factor by default. Factors are intended to hold categorical--also called nominal--data. Categorical data describes variables that can take on one of several distinct values from a set. Examples of categorical variables include gender, state of residence and educational attainment.
Factors take categorical data and assign each category an integer value. The number of factor categories or levels is equal to the number of unique elements in the vector used the make the factor. For example, a factor representing gender would have two levels: male and female.
You can create a factor by passing a character or numeric vector into the factor() function:
In [1]:
gender_vector <- c(rep("male",10),rep("female",15)) # Create a character variable
print(gender_vector)
In [2]:
gender_factor <- factor(gender_vector) # Convert to factor
print(gender_factor)
In [3]:
numeric_vector <- round(runif(20,0,1)) # Create a numeric variable
numeric_factor <- factor(numeric_vector) # Convert to factor
print( numeric_factor )
You can specify the levels a factor can take by passing a character vector of levels to the levels argument:
In [4]:
gender_factor <- factor(gender_vector, levels = c("male","female","other"))
print(gender_factor)
In this case there are no data points that take on the level "other" but the factor allows for the possibility of encountering the category "other".
You can check, rename and add to the levels of a factor with the levels() function:
In [5]:
levels(gender_factor) # Check levels
Out[5]:
In [6]:
levels(gender_factor) <- c("male","female","unknown") # Change levels
levels(gender_factor)
Out[6]:
In [7]:
levels(gender_factor) <- c("male","female","unknown","no_response") # Add a level
levels(gender_factor)
Out[7]:
You can remove factor levels with no data present by recreating the factor with the factor() function or by using the droplevels() function:
In [8]:
gender_factor <- factor(gender_factor) # Recreating a factor drops unused levels
levels(gender_factor)
Out[8]:
In [9]:
gender_factor <- droplevels(gender_factor) # droplevels also removes unused levels
levels(gender_factor)
Out[9]:
R offers a second type of factor called an ordered factor for ordinal data. Ordinal data is non-numeric data that has some sense of natural ordering. For example, a variable with the levels "very low", "low", "medium", "high", and "very high" is not numeric but it has a natural ordering, so it can be encoded as an ordered factor. To create an ordered factor, use the factor() function with the additional argument ordered=TRUE or use the ordered() function:
In [10]:
dat <- rep(c("very low", "low", "medium", "high", "very high"), 5)
dat_factor <- factor(dat,
levels=c("very low", "low", "medium", "high", "very high"),
ordered=TRUE)
print(dat_factor)
In [11]:
dat_factor <- ordered(dat,
levels=c("very low", "low", "medium", "high", "very high"))
print(dat_factor)
*Note: it is important to use the levels argument when creating an ordered factor because the levels you supply are used to create the ordering from lowest to highest.
Factor Indexing
Since factors are essentially vectors with each value being an integer, character level pair, factor indexing works the same as vector indexing:
In [12]:
gender_factor[2] # Get the second element
gender_factor[9:15] # Get a slice of elements
gender_factor[c(3,6,12)] # Get a selection of specific elements
gender_factor[gender_factor=="male"] # Get all values where the level equals male
Out[12]:
Out[12]:
Out[12]:
Out[12]:
Factor Summary Functions
In addition to levels(), factors support several other summary functions:
In [13]:
summary(gender_factor) # summary() returns counts for each level
Out[13]:
In [14]:
str(gender_factor) # str() shows the factor's stucture
In [15]:
length(gender_factor) # Get the length of the factor
Out[15]:
In [16]:
table(gender_factor) # table() creates a data table for the factor
Out[16]:
Factors and ordered factors are useful because many statistical, predictive modeling and graphing functions in R are set up to recognize factors and handle them as categorical variables. When you are performing data analysis, you'll probably want to encode your character data as factors more often than not. On the other hand, factors aren't easy to manipulate so it is best to work with normal atomic data if you are doing data munging.
When loading well-formed data into an R data frame, it can be convenient to have characters converted to factors. If you're loading messy data or data with an unknown structure, you may want to keep text data in the character format and then convert columns to factors later.
Now that we know about all of R's basic data types and data structures we are ready to learn how to load data into R from external sources and write data to files.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.