Tuesday, September 8, 2015

Introduction to R Part 28: Logistic Regression


In the last lesson we introduced linear regression as a predictive modeling method to estimate numeric variables. Now we turn our attention to classification: prediction tasks where the response variable is categorical instead of numeric. In this lesson, we will learn how to use a common classification technique known as logistic regression and apply it to the Titanic survival data we used in lesson 13.

Logistic Regression Basics

Logistic regression is a classification method built on the same concept as linear regression. With linear regression, we take linear combination of explanatory variables plus an intercept term to arrive at a prediction. For example, last time, our simplest linear regression model was:

mileage=intercept+constantCarWeight
Linear regression determines which constants minimize the error this linear combination produces on the input data.
In classification problems, the response variable is categorical. The simplest case of classification is where the response variable is binary, meaning it can only take one of two values, such as true or false. Logistic regression takes a linear combination of explanatory variables plus an intercept term just like linear regression, but then it takes the result and passes it through a "logistic" function. The logistic or sigmoid function used for logistic regression is defined as:

S(t)=11+et
where t is the same linear combination of variables used in linear regression. The sigmoid function looks like an elongated S when plotted:
In [1]:
library(ggplot2)
In [2]:
sigmoid <- function(t){ 1/(1+exp(-t)) }     # Define the sigmoid function

dummy_frame <- data.frame(x=c(-6,6))

ggplot(data=dummy_frame) +                  # Plot the function 
  stat_function(fun=sigmoid) +
  xlim(-6,6)


The sigmoid function is bounded below by 0 and bounded above by 1. In logistic regression, the output is interpreted as a probability: the probability that an observation belongs to the second of the two categories being modeled. When the linear combination of variables produces positive numbers, the resulting probability is greater than 0.5 and when it produces negative numbers, the probability is less than 0.5.
We won't go deeper into the details behind how logistic regression works, but instead focus on how to use it in R. The most important thing to know is that logistic regression outputs probabilities that we can use to classify observations.

Revisiting the Titanic

For the remainder of the lesson we'll be working with the Titanic survival training data from Kaggle that we saw in lesson 13. We'll start by loading the data and then carrying out a few of the same preprocessing tasks we did in lesson 13:
In [3]:
library(caret)
Loading required package: lattice
In [4]:
setwd("C:/Users/Greg/Desktop/Kaggle/titanic")      

titanic_train <- read.csv("titanic_train.csv")

titanic_train$PassengerId  <- NULL             # Remove PassengerId
titanic_train$Ticket  <- NULL                  # Remove Ticket
titanic_train$Name <- as.character(titanic_train$Name)    # Convert name to character

titanic_train$Pclass <- ordered(titanic_train$Pclass,     # Convert to ordered factor
                                levels=c("3","2","1"))  

# Reduce cabin factor levels
char_cabin <- as.character(titanic_train$Cabin)     

new_Cabin <- ifelse(char_cabin == "",          
                    "",                        
                    substr(char_cabin,1,1))    

new_Cabin <- factor(new_Cabin )                
titanic_train$Cabin <- new_Cabin
In [5]:
impute <- preProcess(titanic_train[,c(5:8)],  # Impute missing ages*
                     method=c("knnImpute"))

titanic_train_imp <- predict(impute, titanic_train[,c(5:8)])     

titanic_train <- cbind(titanic_train[,c(1:4)], titanic_train_imp, titanic_train[,c(9:10)])
In [6]:
summary(titanic_train)
Out[6]:
    Survived      Pclass      Name               Sex           Age          
 Min.   :0.0000   3:491   Length:889         female:312   Min.   :-2.01630  
 1st Qu.:0.0000   2:184   Class :character   male  :577   1st Qu.:-0.52730  
 Median :0.0000   1:214   Mode  :character                Median :-0.11330  
 Mean   :0.3825                                           Mean   :-0.01075  
 3rd Qu.:1.0000                                           3rd Qu.: 0.43869  
 Max.   :1.0000                                           Max.   : 3.47465  
                                                                            
     SibSp             Parch              Fare              Cabin     Embarked
 Min.   :-0.4749   Min.   :-0.4741   Min.   :-0.64584          :688   C:168   
 1st Qu.:-0.4749   1st Qu.:-0.4741   1st Qu.:-0.48696   C      : 59   Q: 77   
 Median :-0.4749   Median :-0.4741   Median :-0.35500   B      : 45   S:644   
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   D      : 33           
 3rd Qu.: 0.4311   3rd Qu.:-0.4741   3rd Qu.:-0.02207   E      : 32           
 Max.   : 6.7734   Max.   : 6.9631   Max.   : 9.66311   A      : 15           
                                                        (Other): 17           
*Note: we use only use the numeric variables, Age, sibSp, Parch and Fare for imputation because the impute function we learned only accepts numeric variables. If we wanted more accurate imputation, we could convert categorical variables like Sex and Pclass into several numeric 0,1 numeric indicator variables (dummy variables.). Converting a factor into 0,1 dummy variables is known as one-hot encoding. Also note that the preProcess() function scales and centers the data before imputation.
Now we are ready to use a logistic regression model to predict survival. You can make a logistic regression model by passing a formula to the glm() function. Let's make a logistic regression model that only uses the Sex variable as a predictor:
In [7]:
titanic_model <- glm(Survived ~ Sex,        # Formula
                    data= titanic_train,    # Data set
                    family="binomial")      # family="binomial" for binary logistic

summary(titanic_model)                      # Check model summary
Out[7]:
Call:
glm(formula = Survived ~ Sex, family = "binomial", data = titanic_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6423  -0.6471  -0.6471   0.7753   1.8256  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0480     0.1291   8.116 4.83e-16 ***
Sexmale      -2.5051     0.1673 -14.975  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1182.82  on 888  degrees of freedom
Residual deviance:  916.61  on 887  degrees of freedom
AIC: 920.61

Number of Fisher Scoring iterations: 4
The summary output for the logistic regression model coefficients looks similar to the output we saw for linear regression. We can see the model produced a positive intercept value and a weight of -2.5051 placed on gender level of male and that both the intercept and male variables have high significance. Note that the first level of each class you pass into the model is treated as the default, so we don't need a separate variable for female. Let's use the model to make predictions on the test set:
In [8]:
train_preds <- predict(titanic_model,              # Model to use
                       newdata=titanic_train,      # Data to use for predictions
                       type="response")            # Return predicted probabilities

table(train_preds, titanic_train$Sex)
Out[8]:
                   
train_preds         female male
  0.1889081455806        0  577
  0.740384615384042    312    0
The table shows that the model predicted a survival chance of roughly 19% for males and 74% for females. If we used this simple model to predict survival, we'd end up predicting that all women survived and that all men died. Let's make a more complicated model that includes a few more variables from the titanic training set:
In [9]:
titanic_model <- glm(Survived ~ Sex+Pclass+Age+SibSp+Cabin,
                    data= titanic_train,       
                    family="binomial")         

predicted_probabilities <- predict(titanic_model,              
                           newdata=titanic_train,      
                           type="response")            

summary(titanic_model)
Out[9]:
Call:
glm(formula = Survived ~ Sex + Pclass + Age + SibSp + Cabin, 
    family = "binomial", data = titanic_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6507  -0.5803  -0.4011   0.6128   2.4531  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.2259     0.1904   6.439 1.20e-10 ***
Sexmale      -2.7433     0.1990 -13.783  < 2e-16 ***
Pclass.L      1.2882     0.2859   4.506 6.60e-06 ***
Pclass.Q     -0.1867     0.2152  -0.867 0.385855    
Age          -0.6306     0.1191  -5.295 1.19e-07 ***
SibSp        -0.4518     0.1166  -3.873 0.000107 ***
CabinA        0.7809     0.6706   1.164 0.244245    
CabinB        0.8934     0.5508   1.622 0.104785    
CabinC        0.4144     0.4880   0.849 0.395847    
CabinD        1.2926     0.5655   2.286 0.022260 *  
CabinE        1.5976     0.5561   2.873 0.004067 ** 
CabinF        1.1126     0.7224   1.540 0.123520    
CabinG       -0.8989     1.0576  -0.850 0.395341    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1182.82  on 888  degrees of freedom
Residual deviance:  772.21  on 876  degrees of freedom
AIC: 798.21

Number of Fisher Scoring iterations: 5
We can convert the predicted probabilities to class predictions by assuming that any observation with a predicted probability of 0.5 or above is a positive result (in this case, survived) and that cases below 0.5 are negative for the class we are predicting:
In [10]:
class_preds <- ifelse(predicted_probabilities >= 0.5, 1, 0)  # Convert to 0, 1 preds

result_table <- table(class_preds,             # Make a table of predictions vs. actual
                      titanic_train$Survived)  

result_table
Out[10]:
           
class_preds   0   1
          0 471  92
          1  78 248
The table above shows the classes our model predicted vs. true values of the Survived variable. This table of predicted vs. actual values is known as a confusion matrix.

The Confusion Matrix

The confusion matrix is a common tool for assessing the results of classification. Each cell tells us something different about our predictions versus the true values. The bottom right corner indicates the True positives: people the model predicted to survive who actually did survive. The bottom left cell indicates the false positives: people for whom the model predicted survival who did not actually survive. The top left cell indicates the true negatives: people correctly identified as non survivors. Finally the top right cell shows the false negatives: passengers the model identified as non survivors who actually did survive.
We can calculate the overall prediction accuracy from the matrix by adding the total number of correct predictions and dividing by the total number of predictions. In the case of our model, the prediction accuracy is:
In [11]:
(472+238)/sum(result_table)
Out[11]:
0.798650168728909
Overall prediction accuracy is just one of many quantities you can use to assess a classification model. Oftentimes accuracy is not the best metric for assessing a model.
Consider a model made to predict the occurrence of a disease that only occurs in 0.01% of people. A model that never predicts that anyone has the disease would be 99.99% accurate, but it also wouldn't help save lives. In this case, we might be more interested in the model's sensitivity: the proportion of positive cases that the model correctly identifies as positive.
Relying only on sensitivity can also be a problem. Consider a new model that predicts everyone has the disease. This new model would achieve a sensitivity score of 100% since it would correctly label everyone who has the disease as having the disease. In this case, the model's specificity--the number negative cases the model correctly identifies as negative--is zero, so the model loses any value for distinguishing between the classes.
We won't discuss all the different evaluation metrics that fall out the confusion matrix, but it is a good idea to consider accuracy as well as sensitivity and specificity when assessing model performance. We can view these metrics as well as several others using the caret package's confusionMatrix() function. Let's run it on our Titanic predictions:
In [12]:
confusionMatrix(data= class_preds, 
                reference= titanic_train$Survived,
                positive = "1")                   # Set the positive class to Survived
Out[12]:
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 471  92
         1  78 248
                                          
               Accuracy : 0.8088          
                 95% CI : (0.7814, 0.8341)
    No Information Rate : 0.6175          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.592           
 Mcnemar's Test P-Value : 0.3187          
                                          
            Sensitivity : 0.7294          
            Specificity : 0.8579          
         Pos Pred Value : 0.7607          
         Neg Pred Value : 0.8366          
             Prevalence : 0.3825          
         Detection Rate : 0.2790          
   Detection Prevalence : 0.3667          
      Balanced Accuracy : 0.7937          
                                          
       'Positive' Class : 1               
                                          
The confusion matrix summary output confirms our accuracy calculation and shows us several other evaluation metrics. Balanced accuracy is the average of sensitivity and specificity, which can give us a better sense of overall model performance than pure accuracy.
For the Titanic competition, accuracy is the scoring metric used to judge the competition, so we don't have to worry too much about other metrics.
As a final exercise, let's use our logistic regression model to make predictions for the Titanic test set. First we need to load and prepare the test data using the same steps we used to prepare the training data:
In [13]:
titanic_test <- read.csv("titanic_test.csv")

ids <- titanic_test$PassengerId
titanic_test$PassengerId  <- NULL             
titanic_test$Ticket  <- NULL                  
titanic_test$Name <- as.character(titanic_test$Name)    

titanic_test$Pclass <- ordered(titanic_test$Pclass,     
                                levels=c("3","2","1"))  

char_cabin <- as.character(titanic_test$Cabin)     

new_Cabin <- ifelse(char_cabin == "",          
                    "",                        
                    substr(char_cabin,1,1))    

new_Cabin <- factor(new_Cabin )                
titanic_test$Cabin <- new_Cabin
In [14]:
# Impute missing test set ages using the previously constructed imputation model
titanic_test_imp <- predict(impute, titanic_test[,c(4:7)])

titanic_test <- cbind(titanic_test[,c(1:3)], titanic_test_imp, titanic_test[,c(8:9)])

summary(titanic_test)  # check test data
Out[14]:
 Pclass      Name               Sex           Age               SibSp        
 3:218   Length:418         female:152   Min.   :-2.03355   Min.   :-0.4749  
 2: 93   Class :character   male  :266   1st Qu.:-0.52730   1st Qu.:-0.4749  
 1:107   Mode  :character                Median :-0.11330   Median :-0.4749  
                                         Mean   : 0.02569   Mean   :-0.0696  
                                         3rd Qu.: 0.50769   3rd Qu.: 0.4311  
                                         Max.   : 3.19866   Max.   : 6.7734  
                                                                             
     Parch               Fare              Cabin     Embarked
 Min.   :-0.47406   Min.   :-0.64584          :327   C:102   
 1st Qu.:-0.47406   1st Qu.:-0.48696   C      : 35   Q: 46   
 Median :-0.47406   Median :-0.35500   B      : 18   S:270   
 Mean   : 0.01226   Mean   : 0.07005   D      : 13           
 3rd Qu.:-0.47406   3rd Qu.:-0.01257   E      :  9           
 Max.   :10.68167   Max.   : 9.66311   F      :  8           
                                       (Other):  8           
Now we can use our model to make predictions and save the result to a file:
In [15]:
test_preds <- predict(titanic_model,              
                      newdata=titanic_test,      
                      type="response") 

class_preds <- ifelse(test_preds >= 0.5, 1, 0)

prediction_sub <- data.frame(PassengerId=ids, Survived=class_preds)

write.csv(prediction_sub, "tutorial_logreg_submission.csv", row.names=FALSE)
It turns out that upon submission, this logistic regression model has an accuracy score of 0.76555, which is the exact same accuracy score as the simplistic women survive, men die model. It could be that the gender variable is dominating our model so powerfully that it always predicts men die and women survive. We can investigate this with a table:
In [16]:
table(titanic_test$Sex, class_preds)
Out[16]:
        class_preds
           0   1
  female  11 141
  male   251  15
The model actually does predict that a few men survived and that a few women died, but it still makes no improvement over the basic gender-based model.

Wrap Up

Logistic regression is a common tool for generating class probabilities and predictions. Although logistic regression models are simple and often insufficient to fully capture relationships between variables in many predictive modeling tasks, they are a good starting point because simple models tend not to overfit the data. Next time we will explore another predictive modeling technique for classification: decision trees.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.