Tuesday, July 28, 2015

Introduction to R Part 9: Factors


At the end of the last lesson, we learned that character data loaded into a data frame is converted into a special data structure called a factor by default. Factors are intended to hold categorical--also called nominal--data. Categorical data describes variables that can take on one of several distinct values from a set. Examples of categorical variables include gender, state of residence and educational attainment.
Factors take categorical data and assign each category an integer value. The number of factor categories or levels is equal to the number of unique elements in the vector used the make the factor. For example, a factor representing gender would have two levels: male and female.
You can create a factor by passing a character or numeric vector into the factor() function:
In [1]:
gender_vector <- c(rep("male",10),rep("female",15)) # Create a character variable

print(gender_vector)
 [1] "male"   "male"   "male"   "male"   "male"   "male"   "male"   "male"  
 [9] "male"   "male"   "female" "female" "female" "female" "female" "female"
[17] "female" "female" "female" "female" "female" "female" "female" "female"
[25] "female"
In [2]:
gender_factor <- factor(gender_vector)              # Convert to factor
 
print(gender_factor)
 [1] male   male   male   male   male   male   male   male   male   male  
[11] female female female female female female female female female female
[21] female female female female female
Levels: female male
In [3]:
numeric_vector <- round(runif(20,0,1))             # Create a numeric variable


numeric_factor <-  factor(numeric_vector)          # Convert to factor

print( numeric_factor )
 [1] 0 1 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1
Levels: 0 1
You can specify the levels a factor can take by passing a character vector of levels to the levels argument:
In [4]:
gender_factor <- factor(gender_vector, levels = c("male","female","other"))

print(gender_factor)
 [1] male   male   male   male   male   male   male   male   male   male  
[11] female female female female female female female female female female
[21] female female female female female
Levels: male female other
In this case there are no data points that take on the level "other" but the factor allows for the possibility of encountering the category "other".
You can check, rename and add to the levels of a factor with the levels() function:
In [5]:
levels(gender_factor)                                      # Check levels
Out[5]:
  1. "male"
  2.  
  3. "female"
  4.  
  5. "other"
In [6]:
levels(gender_factor) <- c("male","female","unknown")      # Change levels

levels(gender_factor)
Out[6]:
  1. "male"
  2.  
  3. "female"
  4.  
  5. "unknown"
In [7]:
levels(gender_factor) <- c("male","female","unknown","no_response") # Add a level

levels(gender_factor)
Out[7]:
  1. "male"
  2.  
  3. "female"
  4.  
  5. "unknown"
  6.  
  7. "no_response"
You can remove factor levels with no data present by recreating the factor with the factor() function or by using the droplevels() function:
In [8]:
gender_factor <- factor(gender_factor)  # Recreating a factor drops unused levels

levels(gender_factor)
Out[8]:
  1. "male"
  2.  
  3. "female"
In [9]:
gender_factor <- droplevels(gender_factor) # droplevels also removes unused levels

levels(gender_factor)
Out[9]:
  1. "male"
  2.  
  3. "female"
R offers a second type of factor called an ordered factor for ordinal data. Ordinal data is non-numeric data that has some sense of natural ordering. For example, a variable with the levels "very low", "low", "medium", "high", and "very high" is not numeric but it has a natural ordering, so it can be encoded as an ordered factor. To create an ordered factor, use the factor() function with the additional argument ordered=TRUE or use the ordered() function:
In [10]:
dat <- rep(c("very low", "low", "medium", "high", "very high"), 5)

dat_factor <- factor(dat, 
                     levels=c("very low", "low", "medium", "high", "very high"),
                     ordered=TRUE)

print(dat_factor)
 [1] very low  low       medium    high      very high very low  low      
 [8] medium    high      very high very low  low       medium    high     
[15] very high very low  low       medium    high      very high very low 
[22] low       medium    high      very high
Levels: very low < low < medium < high < very high
In [11]:
dat_factor <- ordered(dat, 
                     levels=c("very low", "low", "medium", "high", "very high"))

print(dat_factor)
 [1] very low  low       medium    high      very high very low  low      
 [8] medium    high      very high very low  low       medium    high     
[15] very high very low  low       medium    high      very high very low 
[22] low       medium    high      very high
Levels: very low < low < medium < high < very high
*Note: it is important to use the levels argument when creating an ordered factor because the levels you supply are used to create the ordering from lowest to highest.

Factor Indexing

Since factors are essentially vectors with each value being an integer, character level pair, factor indexing works the same as vector indexing:
In [12]:
gender_factor[2]                      # Get the second element
gender_factor[9:15]                   # Get a slice of elements
gender_factor[c(3,6,12)]              # Get a selection of specific elements
gender_factor[gender_factor=="male"]  # Get all values where the level equals male
Out[12]:
male
Out[12]:
  1. male
  2.  
  3. male
  4.  
  5. female
  6.  
  7. female
  8.  
  9. female
  10.  
  11. female
  12.  
  13. female
Out[12]:
  1. male
  2.  
  3. male
  4.  
  5. female
Out[12]:
  1. male
  2.  
  3. male
  4.  
  5. male
  6.  
  7. male
  8.  
  9. male
  10.  
  11. male
  12.  
  13. male
  14.  
  15. male
  16.  
  17. male
  18.  
  19. male

Factor Summary Functions

In addition to levels(), factors support several other summary functions:
In [13]:
summary(gender_factor)       # summary() returns counts for each level
Out[13]:
male
10
female
15
In [14]:
str(gender_factor)           # str() shows the factor's stucture
 Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...
In [15]:
length(gender_factor)        # Get the length of the factor
Out[15]:
25
In [16]:
table(gender_factor)         # table() creates a data table for the factor
Out[16]:
gender_factor
  male female 
    10     15 
Factors and ordered factors are useful because many statistical, predictive modeling and graphing functions in R are set up to recognize factors and handle them as categorical variables. When you are performing data analysis, you'll probably want to encode your character data as factors more often than not. On the other hand, factors aren't easy to manipulate so it is best to work with normal atomic data if you are doing data munging.
When loading well-formed data into an R data frame, it can be convenient to have characters converted to factors. If you're loading messy data or data with an unknown structure, you may want to keep text data in the character format and then convert columns to factors later.
Now that we know about all of R's basic data types and data structures we are ready to learn how to load data into R from external sources and write data to files.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.