Friday, September 11, 2015

Introduction to R Part 30: Random Forests


For the final lesson in this introduction to R series, we'll learn about random forest models. As we saw last time, decision trees are a conceptually simple predictive modeling technique, but when you start building deep trees, they become complicated and likely to overfit your training data. In addition, decision trees are constructed in a way such that branch splits are always made on variables that appear to be the most significant first, even if those splits do not lead to optimal outcomes as the tree grows. Random forests are an extension of decision trees that address these shortcomings.

Random Forest Basics

A random forest model is a collection of decision tree models that are combined together to make predictions. When you make a random forest, you have to specify the number of decision trees you want to use to make the model. The random forest algorithm then takes random samples of observations from your training data and builds a decision tree model for each sample. The random samples are typically drawn with replacement, meaning the same observation can be drawn multiple times. The end result is a bunch of decision trees that are created with different groups of data points drawn from the original training data.
The decision trees in a random forest model are a little different than the standard decision trees we made last time. Instead of growing trees where every single explanatory variable can potentially be used to make a branch at any level in the tree, random forests limit the variables that can be used to make a split in the decision tree to some random subset of the explanatory variables. Limiting the splits in this fashion helps avoid the pitfall of always splitting on the same variables and helps random forests create a wider variety of trees to reduce overfitting.
Random forests are an example of an ensemble model: a model composed of some combination of several different underlying models. Ensemble models often yields better results than single models because different models may detect different patterns in the data and combining models tends to dull the tendency that complex single models have to overfit the data.
To build random forests in R, we'll need to install and load the randomForest package:
In [1]:
#install.packages("randomForest")   # Uncomment to install the random forest package
library(randomForest)
library(caret)
randomForest 4.6-10
Type rfNews() to see new features/changes/bug fixes.
Loading required package: lattice
Loading required package: ggplot2

Random Forests on the Titanic

Let's use the randomForest package's randomForest() function to build a predictive model with the Titanic training data. First we'll load and preprocess the data:
In [2]:
setwd("C:/Users/Greg/Desktop/Kaggle/titanic")      

titanic_train <- read.csv("titanic_train.csv")

titanic_train$Pclass <- ordered(titanic_train$Pclass,   # Convert to ordered factor
                                levels=c("3","2","1"))  

impute <- preProcess(titanic_train[,c(6:8,10)],        # Impute missing ages
                     method=c("knnImpute"))

titanic_train_imp <- predict(impute, titanic_train[,c(6:8,10)])     

titanic_train <- cbind(titanic_train[,-c(6:8,10)], titanic_train_imp)

titanic_train$Survived <- as.factor(titanic_train$Survived) # Convert target to factor
Next we'll build our random forest model:
In [3]:
set.seed(12)
rf_model <- randomForest(Survived ~ Sex + Pclass + Age + SibSp + Fare + Embarked,
                         data= titanic_train,   # Data set
                         ntree=1000,            # Number of trees to grow
                         mtry=2)                # Number of branch variables

rf_model               # View model summary
Out[3]:
Call:
 randomForest(formula = Survived ~ Sex + Pclass + Age + SibSp +      Fare + Embarked, data = titanic_train, ntree = 1000, mtry = 2) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 15.52%
Confusion matrix:
    0   1 class.error
0 511  38  0.06921676
1 100 240  0.29411765
The model summary output shows us the formula we used to build the model, the number of trees, the number of variables used at each branch split. The "OOB estimate of error rate" is an estimate of the model's performance based on the performance of each tree on "out of bag" data: the data that was not included in the sample use to create the tree. Checking OOB error is an alternative to assessing a random forest model with holdout validation or cross validation. In this case the OOB error rate of 15.52% suggests the model is about 84.48% accurate.
Let's use the random forest model to make predictions on the Titanic test set and submit them to Kaggle to see how it performs. We can use the same predict() function we used for decision trees to generate predictions:
In [4]:
# Load and prepare the test data
titanic_test <- read.csv("titanic_test.csv")

titanic_test$Pclass <- ordered(titanic_test$Pclass,     # Convert to ordered factor
                                levels=c("3","2","1"))  

# Impute missing test set ages using the previously constructed imputation model
titanic_test_imp <- predict(impute, titanic_test[,c(5:7,9)])

titanic_test <- cbind(titanic_test[,-c(5:7,9)], titanic_test_imp)
In [5]:
# Generate predictions and save them to a file for submission
test_preds <- predict(rf_model,              
                      newdata=titanic_test,      
                      type="class") 

prediction_sub <- data.frame(PassengerId=titanic_test$PassengerId, Survived=test_preds)

write.csv(prediction_sub, "tutorial_rf_submission.csv", row.names=FALSE)
If we submit these predictions to Kaggle, we achieve an accuracy of 0.78947 on the test data, which is a bit higher than any of our previous scores with decision trees or logistic regression.
Although random forests often have better predictive performance than decision trees, they aren't without their drawbacks. Training a random forest model can take much longer than a single decision tree, because you have to build many trees instead of one. The final random forest model can also take up a lot of computer memory depending on the size of the trees, number of trees and the size of the data size you are using. It is easiest start small and ramp up to larger forests with more trees.

Random Forests With the Caret Package

Since random forest models consist of a "bag" of decision trees, each built on a random sample of the data, we can estimate model performance with out of bag error. This means that in the case of random forests, holdout validation and cross validation aren't as necessary to get a sense of the model's ability to generalize to unseen data as models that don't involve this sort of aggregation. Even so, we can use cross validation on a random forest model. Let's use the caret' package's train() function to generate a random forest model with cross validation:
In [6]:
set.seed(12)

# Create a trainControl object
train_control <- trainControl(method = "repeatedcv",   # use cross validation
                              number = 10,             # Use 10 partitions
                              repeats = 2)             # Repeat the cross validation 2 times

# Set required model parameters
tune_grid = expand.grid(mtry=c(2))

# Train model
validated_rf <- train(Survived ~ Sex + Pclass + Age + SibSp + Fare + Embarked, 
                        data=titanic_train,                # Data set
                        method="rf",                       # Model type
                        trControl= train_control,          # model control options
                        tuneGrid = tune_grid,              # Required model parameters
                        ntree = 1000)                      # Additional parameters***
                                                  

validated_rf          # View a summary of the model
Out[6]:
Random Forest 

889 samples
 11 predictor
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 

Summary of sample sizes: 800, 800, 801, 800, 800, 800, ... 

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.8284091  0.6155319  0.0411399    0.09767536

Tuning parameter 'mtry' was held constant at a value of 2
 
Even with this relatively small data set, 10-fold cross validation takes a little while to complete. When working with large data sets, cross validation may become impractically slow for random forest models; in those cases using out of bag error or holdout validation set is quicker and often sufficient. You can use out of bag error for validation when training a model with the caret package by changing the trainControl method to "oob".

Intro to R Conclusion

In this introduction to R lesson series, we built up slowly from the most basic rudiments of the R language to building predictive models that you can apply to real-world data. R is not the most beginner friendly programming language in the world; my hope is that you found this to be an accessible and practical introduction to R. As a series focused on practical tools and geared toward beginners, we didn't always take the time to dig deep into the language or the statistical and predictive models we covered. My hope is that some of the lessons in this guide piqued your interest and equipped you with the tools you need to dig deeper on your own.
If you're interested in learning more about R, there are many ways to proceed. If you learn well with some structure, consider an online data science course that uses R, like the Analytics Edge from edX or one of the many data science offerings on Coursea or Udacity. If you like hands-on learning, try tackling some Kaggle competitions or finding a data set to analyze.
One of the hardest parts of learning a new skill is getting started. If any part of this guide helped you get started, it has served its purpose.



*Final Note: If you are interested in learning Python, I have a 30-part introduction to Python for data analysis that covers same topics and recreates many of the same examples in this guide in Python.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.