At the end of the last lesson, we learned that character data loaded into a data frame is converted into a special data structure called a factor by default. Factors are intended to hold categorical--also called nominal--data. Categorical data describes variables that can take on one of several distinct values from a set. Examples of categorical variables include gender, state of residence and educational attainment.

Factors take categorical data and assign each category an integer value. The number of factor categories or levels is equal to the number of unique elements in the vector used the make the factor. For example, a factor representing gender would have two levels: male and female.

You can create a factor by passing a character or numeric vector into the factor() function:

In [1]:

gender_vector <- c(rep("male",10),rep("female",15)) # Create a character variable

print(gender_vector)

 [1] "male"   "male"   "male"   "male"   "male"   "male"   "male"   "male"  
 [9] "male"   "male"   "female" "female" "female" "female" "female" "female"
[17] "female" "female" "female" "female" "female" "female" "female" "female"
[25] "female"

In [2]:

gender_factor <- factor(gender_vector)              # Convert to factor
 
print(gender_factor)

 [1] male   male   male   male   male   male   male   male   male   male  
[11] female female female female female female female female female female
[21] female female female female female
Levels: female male

In [3]:

numeric_vector <- round(runif(20,0,1))             # Create a numeric variable


numeric_factor <-  factor(numeric_vector)          # Convert to factor

print( numeric_factor )

 [1] 0 1 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1
Levels: 0 1

You can specify the levels a factor can take by passing a character vector of levels to the levels argument:

In [4]:

gender_factor <- factor(gender_vector, levels = c("male","female","other"))

print(gender_factor)

 [1] male   male   male   male   male   male   male   male   male   male  
[11] female female female female female female female female female female
[21] female female female female female
Levels: male female other

In this case there are no data points that take on the level "other" but the factor allows for the possibility of encountering the category "other".

You can check, rename and add to the levels of a factor with the levels() function:

In [5]:

levels(gender_factor)                                      # Check levels

Out[5]:

"male"
"female"
"other"

In [6]:

levels(gender_factor) <- c("male","female","unknown")      # Change levels

levels(gender_factor)

Out[6]:

"male"
"female"
"unknown"

In [7]:

levels(gender_factor) <- c("male","female","unknown","no_response") # Add a level

levels(gender_factor)

Out[7]:

"male"
"female"
"unknown"
"no_response"

You can remove factor levels with no data present by recreating the factor with the factor() function or by using the droplevels() function:

In [8]:

gender_factor <- factor(gender_factor)  # Recreating a factor drops unused levels

levels(gender_factor)

Out[8]:

"male"
"female"

In [9]:

gender_factor <- droplevels(gender_factor) # droplevels also removes unused levels

levels(gender_factor)

Out[9]:

"male"
"female"

R offers a second type of factor called an ordered factor for ordinal data. Ordinal data is non-numeric data that has some sense of natural ordering. For example, a variable with the levels "very low", "low", "medium", "high", and "very high" is not numeric but it has a natural ordering, so it can be encoded as an ordered factor. To create an ordered factor, use the factor() function with the additional argument ordered=TRUE or use the ordered() function:

In [10]:

dat <- rep(c("very low", "low", "medium", "high", "very high"), 5)

dat_factor <- factor(dat, 
                     levels=c("very low", "low", "medium", "high", "very high"),
                     ordered=TRUE)

print(dat_factor)

 [1] very low  low       medium    high      very high very low  low      
 [8] medium    high      very high very low  low       medium    high     
[15] very high very low  low       medium    high      very high very low 
[22] low       medium    high      very high
Levels: very low < low < medium < high < very high

In [11]:

dat_factor <- ordered(dat, 
                     levels=c("very low", "low", "medium", "high", "very high"))

print(dat_factor)

 [1] very low  low       medium    high      very high very low  low      
 [8] medium    high      very high very low  low       medium    high     
[15] very high very low  low       medium    high      very high very low 
[22] low       medium    high      very high
Levels: very low < low < medium < high < very high

*Note: it is important to use the levels argument when creating an ordered factor because the levels you supply are used to create the ordering from lowest to highest.

Factor Indexing

Since factors are essentially vectors with each value being an integer, character level pair, factor indexing works the same as vector indexing:

In [12]:

gender_factor[2]                      # Get the second element
gender_factor[9:15]                   # Get a slice of elements
gender_factor[c(3,6,12)]              # Get a selection of specific elements
gender_factor[gender_factor=="male"]  # Get all values where the level equals male

Out[12]:

male

Out[12]:

male
male
female
female
female
female
female

Out[12]:

male
male
female

Out[12]:

male
male
male
male
male
male
male
male
male
male

Factor Summary Functions

In addition to levels(), factors support several other summary functions:

In [13]:

summary(gender_factor)       # summary() returns counts for each level

Out[13]:

male: 10
female: 15

In [14]:

str(gender_factor)           # str() shows the factor's stucture

 Factor w/ 2 levels "male","female": 1 1 1 1 1 1 1 1 1 1 ...

In [15]:

length(gender_factor)        # Get the length of the factor

Out[15]:

In [16]:

table(gender_factor)         # table() creates a data table for the factor

Out[16]:

gender_factor
  male female 
    10     15

Factors and ordered factors are useful because many statistical, predictive modeling and graphing functions in R are set up to recognize factors and handle them as categorical variables. When you are performing data analysis, you'll probably want to encode your character data as factors more often than not. On the other hand, factors aren't easy to manipulate so it is best to work with normal atomic data if you are doing data munging.

When loading well-formed data into an R data frame, it can be convenient to have characters converted to factors. If you're loading messy data or data with an unknown structure, you may want to keep text data in the character format and then convert columns to factors later.

Now that we know about all of R's basic data types and data structures we are ready to learn how to load data into R from external sources and write data to files.

Life Is Study

Tuesday, July 28, 2015

Introduction to R Part 9: Factors

Factor Indexing

Factor Summary Functions

Next Time: Introduction to R Part 10: Reading and Writing Data

No comments:

Post a Comment