Thursday, September 3, 2015

Introduction to R Part 27: Linear Regression


In the last few lessons we learned about statistical inference techniques including the t-test, chi-squared test and ANOVA which let you analyze differences between data samples. Predictive modeling--using a data samples to make predictions about unseen data, such as data that has yet to be generated--is another common data analytics task. Predictive modeling is a form of machine learning, which describes using computers to automate the process of finding patterns in data.
Machine learning is the driving force behind all kinds of modern conveniences and automation systems like ATMs that can read handwritten text, smartphones that translate speech to text and self-driving cars. The methods used in such cutting-edge applications are more advanced than anything we'll cover in this introduction, but they are all based on the principles of taking data and applying some learning algorithm to it to arrive at some sort of prediction.

This lesson is intended to provide a high level overview of linear regression and how to begin using it in R.

Linear Regression Basics

Linear regression is a predictive modeling technique for predicting a numeric response variable based on one or more explanatory variables. The term "regression" in predictive modeling generally refers to any modeling task that involves predicting a real number (as opposed classification, which involves predicting a category or class.). The term "linear" in the name linear regression refers to the fact that the method models data with linear combination of the explanatory variables. A linear combination is an expression where one or more variables are scaled by a constant factor and added together. In the case of linear regression with a single explanatory variable, the linear combination used in linear regression can be expressed as:

response=intercept+constantexplanatory
The right side if the equation defines a line with a certain y-intercept and slope times the explanatory variable. In other words, linear regression in its most basic form fits a straight line to the response variable. The model is designed to fit a line that minimizes the squared differences (also called errors or residuals.). We won't go into all the math behind how the model actually minimizes the squared errors, but the end result is a line intended to give the "best fit" to the data. Since linear regression fits data with a line, it is most effective in cases where the response and explanatory variable have a linear relationship.
Let's revisit the mtcars data set and use linear regression to predict vehicle gas mileage based on vehicle weight. First, let's look at a scatterplot of weight and mpg to get a sense of the shape of the data:
In [1]:
library(ggplot2)
In [2]:
ggplot(data=mtcars, aes(x=wt, y=mpg)) +
    geom_point()


The scatterplot shows a roughly linear relationship between weight and mpg, suggesting a linear regression model might work well.
To fit a linear model in R, pass a formula specifying the model to the lm() function:
In [3]:
mpg_model <- lm(mpg ~ wt,       # Formula of the form response ~ explanatory
                data=mtcars)    # Data set to use

summary(mpg_model)              # View a summary of the regression model
Out[3]:
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
The output above shows the formula used to make the model, followed by a five number summary of the residuals and a summary of the model coefficients. The coefficients are the constants used to create the best fit line: in this case the y-intercept term is set to 37.2851 and the coefficient for the weight variable is -5.3445. In other words, the model fit the line mpg = 37.2851 -5.3445*wt.
The t-value and Pr(>|t|) (p-value) columns indicate the results of conducting a t-test checking whether the coefficients chosen differ significantly from zero. In this case, there is near certainty that both the y-intercept and weight variable coefficients are not zero given the extremely low p-values, meaning both the intercept term and the weight are likely to be useful for predicting mpg.
At the bottom of the summary output, the values of interest to us are "Multiple R-squared" and "Adjusted R-squared." Multiple R-squared is a value that describes the proportion of the variance in the response variable explained by the model. In this case, it basically tells how much of the variation in mpg can be explained by weight. Adjusted R-squared is a modification of the normal R-squared that adjusts for models that involve multiple explanatory variables. As you add more explanatory variables to a model, the normal R-squared reading can only increase, even if those variables add little information to the model. In general, we'd like to keep models as simple as possible, so adjusted R-squared is preferable for assessing the explanatory power of the model. In this case, roughly 75% of the variation in mpg is explained by weight.
Let's plot the regression line on the scatterplot to get a sense of how well it fits the data:
In [4]:
ggplot(data=mtcars, aes(x=wt, y=mpg)) +
    geom_point() +
    geom_abline(intercept = 37.2851, slope = -5.3445, color="red")


The regression line looks like a reasonable fit and it follows our intuition: as car weight increases we would expect fuel economy to decline.
Outliers can have a large influence on linear regression models: since regression deals with minimizing squared residuals, large residuals have a disproportionately large influence on the model. Plotting the result helps us detect influential outliers. In this case there does not appear to be any influential outliers. Let's add an outlier--a super heavy fuel efficient car--and plot a new regression model:
In [5]:
super_car <- c(50,10)                                  # Outlier car 
new_cars <- rbind(mtcars[,c("mpg","wt")], super_car)   # Add outlier to mtcars

ggplot(data=new_cars, aes(x=new_cars$wt, y=new_cars$mpg))+
    geom_point() +
    geom_smooth(method=lm,      # Add a linear regression line
                se=FALSE,       # Don't add shaded confidence region
                color="red")


Although this is an extreme, contrived case, the plot above illustrates how much influence a single outlier can have on a linear regression model.

Making Predictions

After creating a model, we'd like to be able to use it to make predictions. R contains a built-in function called predict() you can use to generate predictions based on a specified model for the original data set or for new data with the same variables as the original data. Let's use predict() to make predictions on the original data:
In [6]:
preds <- predict(mpg_model,    # Model to use for prediction
         newdata=mtcars)       # Data to use for prediction

preds                          # Check predictions
Out[6]:
Mazda RX4
23.2826106468086
Mazda RX4 Wag
21.9197703957643
Datsun 710
24.8859521186254
Hornet 4 Drive
20.1026500610386
Hornet Sportabout
18.900143957176
Valiant
18.7932545257216
Duster 360
18.2053626527221
Merc 240D
20.2362618503567
Merc 230
20.4500407132656
Merc 280
18.900143957176
Merc 280C
18.900143957176
Merc 450SE
15.5331268663607
Merc 450SL
17.3502472010864
Merc 450SLC
17.0830236224503
Cadillac Fleetwood
9.22665041054798
Lincoln Continental
8.29671235689423
Chrysler Imperial
8.71892561113932
Fiat 128
25.5272887073521
Honda Civic
28.6538045773949
Toyota Corolla
27.4780208313959
Toyota Corona
24.1110037405806
Dodge Challenger
18.4725862313582
AMC Javelin
18.9268663150396
Camaro Z28
16.7623553280869
Pontiac Firebird
16.7356329702233
Fiat X1-9
26.9435736741236
Porsche 914-2
25.8479570017155
Lotus Europa
29.1989406778126
Ford Pantera L
20.3431512818111
Ferrari Dino
22.4809399109002
Maserati Bora
18.2053626527221
Volvo 142E
22.427495195173
*Note: In practice, we'd typically want to make predictions based on some new unseen data (test data) that did not play a part in creating the original model. We will see examples of making predictions on test data when we revisit the Titanic survival prediction competition in the next lesson.
We can check the differences between our predictions and the actual values by subtracting the predictions from the true values.
In [7]:
residuals <- mtcars$mpg-preds      # Check residuals
residuals
Out[7]:
Mazda RX4
-2.28261064680861
Mazda RX4 Wag
-0.919770395764324
Datsun 710
-2.08595211862541
Hornet 4 Drive
1.29734993896138
Hornet Sportabout
-0.200143957176017
Valiant
-0.69325452572156
Duster 360
-3.90536265272207
Merc 240D
4.16373814964332
Merc 230
2.34995928673441
Merc 280
0.299856042823983
Merc 280C
-1.10014395717602
Merc 450SE
0.86687313363927
Merc 450SL
-0.0502472010864388
Merc 450SLC
-1.88302362245031
Cadillac Fleetwood
1.17334958945202
Lincoln Continental
2.10328764310577
Chrysler Imperial
5.98107438886068
Fiat 128
6.87271129264787
Honda Civic
1.7461954226051
Toyota Corolla
6.42197916860409
Toyota Corona
-2.61100374058062
Dodge Challenger
-2.9725862313582
AMC Javelin
-3.72686631503963
Camaro Z28
-3.46235532808695
Pontiac Firebird
2.46436702977667
Fiat X1-9
0.35642632587636
Porsche 914-2
0.152042998284507
Lotus Europa
1.20105932218739
Ford Pantera L
-4.54315128181114
Ferrari Dino
-2.78093991090021
Maserati Bora
-3.20536265272207
Volvo 142E
-1.02749519517298
In a well-behaved linear regression model, we'd like the residuals to be roughly normally distributed. That is, we'd like a roughly even spread of error above and below the regression line. We can investigate the normality of residuals with a Q-Q (quantile-quantile) plot:
In [8]:
qqnorm(residuals)       # Make a QQ plot of residuals
qqline(residuals)


When residuals are normally distributed, they tend to lie along the straight line on the Q-Q plot. In this case residuals appear to follow a slightly non-linear pattern: the residuals are bowed a bit away from the normality line on each end. This is an indication that simple straight line might not be sufficient to fully describe the relationship between weight and mpg.
After making model predictions, it is useful to have some sort of metric to evaluate oh well the model performed. Adjusted R-squared is one useful measure, but it only applies to the regression model itself: we'd like some universal evaluation metric that lets us compare the performance of different types of models. Root mean squared error (RMSE) is a common evaluation metric for predictions involving real numbers. Root mean squared error is square root of the average of the squared error (the squared residuals.). If you recall, we wrote a function to calculate RMSE in lesson 11:
In [9]:
root_mean_squared_error <- function(predicted, targets){  
    # Computes root mean squared error between two vectors
    #
    # Args:
    #    predicted: a numeric vector of predictions
    #    targets: a numeric vector of target values for each prediction
    #
    # Returns:
    #    The root mean squared error between predicted values and targets
    
    sqrt(mean((targets-predicted)^2))                 
}

root_mean_squared_error(preds, mtcars$mpg)           # compute RMSE
Out[9]:
2.94916268595503
We could also use a pre-made RMSE function, such as the one provided in the caret package:
In [10]:
library(caret)
Loading required package: lattice
In [11]:
RMSE(preds, mtcars$mpg)
Out[11]:
2.94916268595503

Polynomial Regression

Variables often exhibit non-linear relationships that can't be fit well with a straight line. In these cases, we can use linear regression to fit a curved line the data by adding extra higher order terms (squared, cubic, etc.) to the model formula. A linear regression that involves higher order terms is known as "polynomial regression." You can include higher order terms by adding extra explanatory terms wrapped in the I() function:
In [12]:
quadratic_model <- lm(mpg ~ wt + I(wt^2),  # Add the weight squared term to the model
                      data=mtcars) 

summary(quadratic_model)                   # Check the model result
Out[12]:
Call:
lm(formula = mpg ~ wt + I(wt^2), data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.483 -1.998 -0.773  1.462  6.238 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  49.9308     4.2113  11.856 1.21e-12 ***
wt          -13.3803     2.5140  -5.322 1.04e-05 ***
I(wt^2)       1.1711     0.3594   3.258  0.00286 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.651 on 29 degrees of freedom
Multiple R-squared:  0.8191, Adjusted R-squared:  0.8066 
F-statistic: 65.64 on 2 and 29 DF,  p-value: 1.715e-11
The summary output shows us that including the weight squared term appears to improve the model's performance because the adjusted R-squared increased to 0.8066 and the 2nd order polynomial term has a p-value of 0.00286, indicated that it is highly significant. Let's plot the curved line defined by the new model to see if the fit looks better than the old one:
In [13]:
model_function <- function(x){              # Create a function based on the model
    49.9308 + -13.3803*x + 1.1711*x^2
}

ggplot(data=mtcars, aes(x=wt, y=mpg)) +
    geom_point() +
    stat_function(fun=model_function, color="red")    # Plot the function


The quadratic function seems to fit the data a little better than the linear one. Let's investigate further by using the new model to make predictions and check the root mean squared error:
In [14]:
preds <- predict(quadratic_model,    # Model to use for prediction
         newdata=mtcars)             # Data to use for prediction

root_mean_squared_error(preds, mtcars$mpg)
Out[14]:
2.52330047246108
Since the RMSE of the quadratic model is lower than the old one and the adjusted R-squared is higher, it is probably a better model. We do, however, have to be careful about overfitting the data.
Overfitting describes a situation where our model fits the data we use to create it (training data) too closely, resulting in poor generalization to new data. This is why we generally don't want to use training data to evaluate a model: it gives us a biased, usually overly optimistic evaluation. One of the strengths of first and second order linear regression is that they are so simple, they are unlikely to overfit data very much. The more complex the model we create and the more freedom it has to fit the training data, the greater risk we run of overfitting. For example, we could keep including more polynomial terms in our regression model to fit the training data more closely and achieve lower RMSE scores against the training set, but this would almost certainly not generalize well to new data. Let's illustrate this point by fitting a 10th order model to the mtcars data:
In [15]:
tenth_order_model <- lm(mpg ~ wt + I(wt^2) + I(wt^3) + I(wt^4) + I(wt^5) +
                        I(wt^6) + I(wt^7) + I(wt^8) + I(wt^9) + +I(wt^10),
                        data=mtcars)

tenth_order_model$coefficients                  # Check model coefficients
Out[15]:
(Intercept)
-14921.1256679509
wt
64581.3723400433
I(wt^2)
-120086.155632163
I(wt^3)
126931.950142918
I(wt^4)
-84659.8581288786
I(wt^5)
37315.5247656017
I(wt^6)
-11033.4768062176
I(wt^7)
2165.90426790932
I(wt^8)
-270.730570029588
I(wt^9)
19.4974178769716
I(wt^10)
-0.615515485385577
In [16]:
model_function2 <- function(x){    # Create a function based on the model coefficients

-14921.1256679509 + 64581.3723400433*x - 120086.155632163*x^2 + 126931.950142918*x^3 - 84659.8581288786*x^4 + 37315.5247656017*x^5 - 11033.4768062176*x^6 +  2165.90426790932*x^7 - 270.730570029588*x^8 + 19.4974178769716*x^9 - 0.61551548538557*x^10
}
In [17]:
ggplot(data=mtcars, aes(x=wt, y=mpg)) +               
    geom_point() +
    stat_function(fun=model_function2, color="red")


Notice how the 10th order polynomial model curves wildly in some places to fit the training data. While this model happens to yield a closer fit to the training data, it will almost certainly fail to generalize well to new data as it leads to absurd predictions such as a car having less than 0 mpg if it weighs 5000lbs.

Multiple Linear Regression

When faced with a predictive modeling task, you'll often have several variables in your data that may help explain variation in the response variable. You can include more explanatory variables in a linear regression model by adding extra terms to the formula. Let's make a new model that adds the horsepower variable to our original model:
In [18]:
mpg_model <- lm(mpg ~ wt + hp,  # Add horsepower
                data=mtcars)    

summary(mpg_model)
Out[18]:
Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
The regression output shows high significance for the horsepower variable and a better adjusted R-squared score than any of our previous models. This improvement suggests horsepower has a linear relationship with mpg. Let's investigate with a plot:
In [19]:
ggplot(data=mtcars, aes(x=hp, y=mpg)) +               
    geom_point()


While mpg does tend to decline with horsepower, the relationship appears more curved than linear so adding polynomial terms to our multiple regression model could yield a better fit:
In [20]:
mpg_model <- lm(mpg ~ wt + hp + I(wt^2) + I(hp^2), # Include squared terms
                data=mtcars)

summary(mpg_model)                    # View a summary of the regression model

sqrt(mean((mpg_model$residuals)^2))   # Calculate RMSE
Out[20]:
Call:
lm(formula = mpg ~ wt + hp + I(wt^2) + I(hp^2), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8849 -1.8165 -0.3922  1.3499  4.5807 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.945e+01  3.521e+00  14.044 6.27e-14 ***
wt          -9.220e+00  2.270e+00  -4.062 0.000375 ***
hp          -9.428e-02  3.193e-02  -2.952 0.006456 ** 
I(wt^2)      8.500e-01  3.005e-01   2.829 0.008700 ** 
I(hp^2)      1.743e-04  8.073e-05   2.159 0.039879 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.135 on 27 degrees of freedom
Multiple R-squared:  0.8907, Adjusted R-squared:  0.8745 
F-statistic: 55.02 on 4 and 27 DF,  p-value: 1.363e-12
Out[20]:
1.9609108134177
The higher adjusted R-squared and lower RMSE suggest this is a better model than any we made previously. Note that when working with multidimensional models, it becomes difficult to visualize results, so you rely heavily on numeric output.
We could continue adding more explanatory variables in an attempt to improve the model. Sometimes you may wish to include every variable at your disposal into a model. In an R formula, you can use the period character "." as shorthand for "include all variables":
In [21]:
mpg_model <- lm(mpg ~ .,        # Make a model with all mtcars variables
                data=mtcars)

summary(mpg_model)              # View a summary of the regression model
Out[21]:
Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
It appears a model involving all variables does not perform as well as our weight and horsepower based model. Adding variables that have little relationship with the response or including variables that are too closely related to one another can hurt your results when using linear regression. You should also be wary of numeric variables that take on few unique values since they often act more like categorical variables than numeric ones. It may improve your results to convert such variables into factors (The lm() function accepts factor variables.).

Wrap Up

Linear regression is one of the most common techniques for making real numbered predictions from data. It is a good place to start any time you need to make a numeric prediction. Next time we'll revisit the titanic survival data set and focus classification: assigning observations to categories.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.