In the last few lessons we learned about statistical inference techniques including the t-test, chi-squared test and ANOVA which let you analyze differences between data samples. Predictive modeling--using a data samples to make predictions about unseen data, such as data that has yet to be generated--is another common data analytics task. Predictive modeling is a form of machine learning, which describes using computers to automate the process of finding patterns in data.
Machine learning is the driving force behind all kinds of modern conveniences and automation systems like ATMs that can read handwritten text, smartphones that translate speech to text and self-driving cars. The methods used in such cutting-edge applications are more advanced than anything we'll cover in this introduction, but they are all based on the principles of taking data and applying some learning algorithm to it to arrive at some sort of prediction.
This lesson is intended to provide a high level overview of linear regression and how to begin using it in R.
This lesson is intended to provide a high level overview of linear regression and how to begin using it in R.
Linear Regression Basics
Linear regression is a predictive modeling technique for predicting a numeric response variable based on one or more explanatory variables. The term "regression" in predictive modeling generally refers to any modeling task that involves predicting a real number (as opposed classification, which involves predicting a category or class.). The term "linear" in the name linear regression refers to the fact that the method models data with linear combination of the explanatory variables. A linear combination is an expression where one or more variables are scaled by a constant factor and added together. In the case of linear regression with a single explanatory variable, the linear combination used in linear regression can be expressed as:
The right side if the equation defines a line with a certain y-intercept and slope times the explanatory variable. In other words, linear regression in its most basic form fits a straight line to the response variable. The model is designed to fit a line that minimizes the squared differences (also called errors or residuals.). We won't go into all the math behind how the model actually minimizes the squared errors, but the end result is a line intended to give the "best fit" to the data. Since linear regression fits data with a line, it is most effective in cases where the response and explanatory variable have a linear relationship.
Let's revisit the mtcars data set and use linear regression to predict vehicle gas mileage based on vehicle weight. First, let's look at a scatterplot of weight and mpg to get a sense of the shape of the data:
In [1]:
library(ggplot2)
The scatterplot shows a roughly linear relationship between weight and mpg, suggesting a linear regression model might work well.
To fit a linear model in R, pass a formula specifying the model to the lm() function:
In [3]:
mpg_model <- lm(mpg ~ wt, # Formula of the form response ~ explanatory
data=mtcars) # Data set to use
summary(mpg_model) # View a summary of the regression model
Out[3]:
The output above shows the formula used to make the model, followed by a five number summary of the residuals and a summary of the model coefficients. The coefficients are the constants used to create the best fit line: in this case the y-intercept term is set to 37.2851 and the coefficient for the weight variable is -5.3445. In other words, the model fit the line mpg = 37.2851 -5.3445*wt.
The t-value and Pr(>|t|) (p-value) columns indicate the results of conducting a t-test checking whether the coefficients chosen differ significantly from zero. In this case, there is near certainty that both the y-intercept and weight variable coefficients are not zero given the extremely low p-values, meaning both the intercept term and the weight are likely to be useful for predicting mpg.
At the bottom of the summary output, the values of interest to us are "Multiple R-squared" and "Adjusted R-squared." Multiple R-squared is a value that describes the proportion of the variance in the response variable explained by the model. In this case, it basically tells how much of the variation in mpg can be explained by weight. Adjusted R-squared is a modification of the normal R-squared that adjusts for models that involve multiple explanatory variables. As you add more explanatory variables to a model, the normal R-squared reading can only increase, even if those variables add little information to the model. In general, we'd like to keep models as simple as possible, so adjusted R-squared is preferable for assessing the explanatory power of the model. In this case, roughly 75% of the variation in mpg is explained by weight.
Let's plot the regression line on the scatterplot to get a sense of how well it fits the data:
The regression line looks like a reasonable fit and it follows our intuition: as car weight increases we would expect fuel economy to decline.
Outliers can have a large influence on linear regression models: since regression deals with minimizing squared residuals, large residuals have a disproportionately large influence on the model. Plotting the result helps us detect influential outliers. In this case there does not appear to be any influential outliers. Let's add an outlier--a super heavy fuel efficient car--and plot a new regression model:
In [5]:
super_car <- c(50,10) # Outlier car
new_cars <- rbind(mtcars[,c("mpg","wt")], super_car) # Add outlier to mtcars
ggplot(data=new_cars, aes(x=new_cars$wt, y=new_cars$mpg))+
geom_point() +
geom_smooth(method=lm, # Add a linear regression line
se=FALSE, # Don't add shaded confidence region
color="red")
Although this is an extreme, contrived case, the plot above illustrates how much influence a single outlier can have on a linear regression model.
Making Predictions
After creating a model, we'd like to be able to use it to make predictions. R contains a built-in function called predict() you can use to generate predictions based on a specified model for the original data set or for new data with the same variables as the original data. Let's use predict() to make predictions on the original data:
In [6]:
preds <- predict(mpg_model, # Model to use for prediction
newdata=mtcars) # Data to use for prediction
preds # Check predictions
Out[6]:
*Note: In practice, we'd typically want to make predictions based on some new unseen data (test data) that did not play a part in creating the original model. We will see examples of making predictions on test data when we revisit the Titanic survival prediction competition in the next lesson.
We can check the differences between our predictions and the actual values by subtracting the predictions from the true values.
In [7]:
residuals <- mtcars$mpg-preds # Check residuals
residuals
Out[7]:
In a well-behaved linear regression model, we'd like the residuals to be roughly normally distributed. That is, we'd like a roughly even spread of error above and below the regression line. We can investigate the normality of residuals with a Q-Q (quantile-quantile) plot:
When residuals are normally distributed, they tend to lie along the straight line on the Q-Q plot. In this case residuals appear to follow a slightly non-linear pattern: the residuals are bowed a bit away from the normality line on each end. This is an indication that simple straight line might not be sufficient to fully describe the relationship between weight and mpg.
After making model predictions, it is useful to have some sort of metric to evaluate oh well the model performed. Adjusted R-squared is one useful measure, but it only applies to the regression model itself: we'd like some universal evaluation metric that lets us compare the performance of different types of models. Root mean squared error (RMSE) is a common evaluation metric for predictions involving real numbers. Root mean squared error is square root of the average of the squared error (the squared residuals.). If you recall, we wrote a function to calculate RMSE in lesson 11:
In [9]:
root_mean_squared_error <- function(predicted, targets){
# Computes root mean squared error between two vectors
#
# Args:
# predicted: a numeric vector of predictions
# targets: a numeric vector of target values for each prediction
#
# Returns:
# The root mean squared error between predicted values and targets
sqrt(mean((targets-predicted)^2))
}
root_mean_squared_error(preds, mtcars$mpg) # compute RMSE
Out[9]:
We could also use a pre-made RMSE function, such as the one provided in the caret package:
In [10]:
library(caret)
In [11]:
RMSE(preds, mtcars$mpg)
Out[11]:
Polynomial Regression
Variables often exhibit non-linear relationships that can't be fit well with a straight line. In these cases, we can use linear regression to fit a curved line the data by adding extra higher order terms (squared, cubic, etc.) to the model formula. A linear regression that involves higher order terms is known as "polynomial regression." You can include higher order terms by adding extra explanatory terms wrapped in the I() function:
In [12]:
quadratic_model <- lm(mpg ~ wt + I(wt^2), # Add the weight squared term to the model
data=mtcars)
summary(quadratic_model) # Check the model result
Out[12]:
The summary output shows us that including the weight squared term appears to improve the model's performance because the adjusted R-squared increased to 0.8066 and the 2nd order polynomial term has a p-value of 0.00286, indicated that it is highly significant. Let's plot the curved line defined by the new model to see if the fit looks better than the old one:
The quadratic function seems to fit the data a little better than the linear one. Let's investigate further by using the new model to make predictions and check the root mean squared error:
In [14]:
preds <- predict(quadratic_model, # Model to use for prediction
newdata=mtcars) # Data to use for prediction
root_mean_squared_error(preds, mtcars$mpg)
Out[14]:
Since the RMSE of the quadratic model is lower than the old one and the adjusted R-squared is higher, it is probably a better model. We do, however, have to be careful about overfitting the data.
Overfitting describes a situation where our model fits the data we use to create it (training data) too closely, resulting in poor generalization to new data. This is why we generally don't want to use training data to evaluate a model: it gives us a biased, usually overly optimistic evaluation. One of the strengths of first and second order linear regression is that they are so simple, they are unlikely to overfit data very much. The more complex the model we create and the more freedom it has to fit the training data, the greater risk we run of overfitting. For example, we could keep including more polynomial terms in our regression model to fit the training data more closely and achieve lower RMSE scores against the training set, but this would almost certainly not generalize well to new data. Let's illustrate this point by fitting a 10th order model to the mtcars data:
In [15]:
tenth_order_model <- lm(mpg ~ wt + I(wt^2) + I(wt^3) + I(wt^4) + I(wt^5) +
I(wt^6) + I(wt^7) + I(wt^8) + I(wt^9) + +I(wt^10),
data=mtcars)
tenth_order_model$coefficients # Check model coefficients
Out[15]:
In [16]:
model_function2 <- function(x){ # Create a function based on the model coefficients
-14921.1256679509 + 64581.3723400433*x - 120086.155632163*x^2 + 126931.950142918*x^3 - 84659.8581288786*x^4 + 37315.5247656017*x^5 - 11033.4768062176*x^6 + 2165.90426790932*x^7 - 270.730570029588*x^8 + 19.4974178769716*x^9 - 0.61551548538557*x^10
}
Notice how the 10th order polynomial model curves wildly in some places to fit the training data. While this model happens to yield a closer fit to the training data, it will almost certainly fail to generalize well to new data as it leads to absurd predictions such as a car having less than 0 mpg if it weighs 5000lbs.
Multiple Linear Regression
When faced with a predictive modeling task, you'll often have several variables in your data that may help explain variation in the response variable. You can include more explanatory variables in a linear regression model by adding extra terms to the formula. Let's make a new model that adds the horsepower variable to our original model:
In [18]:
mpg_model <- lm(mpg ~ wt + hp, # Add horsepower
data=mtcars)
summary(mpg_model)
Out[18]:
The regression output shows high significance for the horsepower variable and a better adjusted R-squared score than any of our previous models. This improvement suggests horsepower has a linear relationship with mpg. Let's investigate with a plot:
While mpg does tend to decline with horsepower, the relationship appears more curved than linear so adding polynomial terms to our multiple regression model could yield a better fit:
In [20]:
mpg_model <- lm(mpg ~ wt + hp + I(wt^2) + I(hp^2), # Include squared terms
data=mtcars)
summary(mpg_model) # View a summary of the regression model
sqrt(mean((mpg_model$residuals)^2)) # Calculate RMSE
Out[20]:
Out[20]:
The higher adjusted R-squared and lower RMSE suggest this is a better model than any we made previously. Note that when working with multidimensional models, it becomes difficult to visualize results, so you rely heavily on numeric output.
We could continue adding more explanatory variables in an attempt to improve the model. Sometimes you may wish to include every variable at your disposal into a model. In an R formula, you can use the period character "." as shorthand for "include all variables":
In [21]:
mpg_model <- lm(mpg ~ ., # Make a model with all mtcars variables
data=mtcars)
summary(mpg_model) # View a summary of the regression model
Out[21]:
It appears a model involving all variables does not perform as well as our weight and horsepower based model. Adding variables that have little relationship with the response or including variables that are too closely related to one another can hurt your results when using linear regression. You should also be wary of numeric variables that take on few unique values since they often act more like categorical variables than numeric ones. It may improve your results to convert such variables into factors (The lm() function accepts factor variables.).
Wrap Up
Linear regression is one of the most common techniques for making real numbered predictions from data. It is a good place to start any time you need to make a numeric prediction. Next time we'll revisit the titanic survival data set and focus classification: assigning observations to categories.







No comments:
Post a Comment
Note: Only a member of this blog may post a comment.