Numeric data tends to be more well-behaved than text data. There's only so many symbols that appear in numbers and they have well-defined values. Despite its relative cleanliness, there are variety of preprocessing tasks you should consider before using numeric data. In this lesson, we'll learn some common operations used to prepare numeric data for use in analysis and predictive models.
Centering and Scaling Data
Numeric variables are often on different scales and cover different ranges, so they can't be easily compared. What's more, variables with large values can dominate those with smaller values when using certain modeling techniques. Centering and scaling is a common preprocessing task that puts numeric variables on a common scale so no single variable will dominate the others.
The simplest way to center data is to subtract the mean value from each data point. Subtracting the mean centers the data around zero and sets the new mean to zero. Let's try zero-centering the mtcars dataset:
In [1]:
cars <- mtcars
print(head(cars))
column_means <- colMeans(cars) # Get the means of each column
print(column_means) # Check means
With the column means in hand, we just need to subtract the column means from each row in an element-wise fashion. To do this we can create a matrix with the same number of rows as the cars data set, where each row contains the column means:
In [2]:
center_matrix <- matrix( rep(column_means, nrow(cars)), # Repeat the column means
nrow=nrow(cars),
ncol=ncol(cars),
byrow = TRUE) # Construct row by row
centered <- cars - center_matrix # Subtract column means
print( head( centered )) # Check the new data set
print(colMeans(centered)) # Check the new column means to confirm they are 0
With zero-centered data, negative values are below average and positive values are above average.
Now that the data is centered, we'd like to put it all on a common scale. One way to put data on a common scale is to divide by the standard deviation. Standard deviation is a statistic that describes the spread of numeric data. The higher the standard deviation, the further the data points tend to be spread away from the mean value. You can get standard deviations with the sd() function:
In [3]:
sd(centered$mpg) # Get the standard deviation of the mpg column
Out[3]:
We need to get the standard deviation of each column. Unfortunately, there's no simple built in function designed to give us the standard deviation of each column like the colMeans() function did when we needed the averages of each column. Instead, we can use the apply() function. apply() lets you apply a function you supply to each row or each column of a matrix or data frame:
In [4]:
column_sds <- apply(centered, # A matrix or data frame
MARGIN = 2, # Operate on rows(1) or columns(2)
FUN = sd) # Function to apply
print(column_sds) # Check standard deviations
Now that we have the column standard deviations, we can use the same matrix construction method we used before to scale the data:
In [5]:
scale_matrix <- matrix( rep(column_sds, nrow(cars)), # Repeat the column sds
nrow=nrow(cars),
ncol=ncol(cars),
byrow = TRUE)
centered_scaled <- centered/scale_matrix # Divide by column sds to scale the data
summary(centered_scaled) # Confirm that variables are on similar scales
Out[5]:
Manually centering and scaling as we've done is a good exercise and it gave us an excuse to learn a couple new functions, but like many common tasks in R, built in functions and packages can often make your life easier. It turns out R has a built in function, scale(), that automatically centers and scales data:
In [6]:
auto_scaled <- scale(cars, # Numeric data object
center=TRUE, # Center the data?
scale=TRUE) # Scale the data?
summary(auto_scaled) # Check the auto scaled data
Out[6]:
Note that the summary output is identical for both the automatically scaled and the data we scaled manually.
Dealing With Skewed Data
The distribution of data--its overall shape and how it is spread out--can have a significant impact on analysis and modeling. Data that is roughly evenly spread around the mean value--known as normally distributed data--tends to be well-behaved. On the other hand, some data sets exhibit significant skewness or asymmetry. To illustrate, let's generate a few distributions
In [7]:
*Note: normally distributed data tends to look roughly symmetric with a bell-shaped curve.
*Note: data with a long tail that goes off to the right is called positively skewed or right skewed.
When you have a skewed distribution like the one above, the extreme values in the long tail can have a disproportionately large influence on whatever test you perform or models you build. Reducing skew may improve your results. Taking the square root of each data point or taking the natural logarithm of each data point are two simple transformations that can reduce skew. Let's see their effects on the skewed data we generated earlier:
*Note: adding 1 before taking the log ensures we don't end up with negative values. Also note that neither of these transformations work on data containing negative values. To make them work on data with negative values add a constant to each value that is large enough to make all the data greater than or equal to 1 (such as adding the absolute value of the smallest number +1)
Both the sqrt() and log() transforms reduced the skew of the data. It's still not quite normally distributed, but the amount of extreme data in the tails has been reduced to the point where we might not be so worried about it having a large influence on our results.
Highly Correlated Variables
In predictive modeling, each variable you use to construct a model would ideally represent some unique feature of the data. In other words, you want each variable to tell you something different. In reality, variables often exhibit collinearity--a strong correlation or tendency to move together, typically due to some underlying similarity or common influencing factor. Variables with strong correlations can interfere with one another when performing modeling and muddy results.
You can check the pairwise correlations between numeric variables using the cor() function:
In [11]:
cor(cars[,1:6]) # Check the pairwise correlations of 6 variables
Out[11]:
A positive correlation implies that when one variable goes up the other tends to go up as well. Negative correlations indicate an inverse relationship: when one variable goes up the other tends to go down. A correlation near zero indicates low correlation while a correlation near -1 or 1 indicates a large negative or positive correlation.
Inspecting the data table, we see that the number of cylinders a car has (cyl) and its weight (wt) have fairly strong negative correlations to gas mileage (mpg.). This indicates that heavier cars and cars with more cylinders tend to get lower gas mileage.
A scatter plot matrix can be a helpful visual aide for inspecting collinearity. We can create one with the pairs() function:
A scatter plot matrix creates pairwise scatter plots that let you visually inspect the relationships between pairs of variables. It can also help identify oddities in the data, such as variables like cyl that only take on values in a small discrete set.
If you find highly correlated variables, there are a few things you can do including:
- Leave them be
- Remove one
- Combine them in some way
Reducing the number of variables under consideration, either by removing some or by combining them some way is known as "dimensionality reduction." How you choose to handle correlated variables is ultimately a subjective decision that should be informed by your goal.
Imputing Missing Data
In the lesson on initial data exploration, we explored Titanic survivor data and found that several passengers had missing or NA values listed for age. Missing values in numeric data are troublesome because you can't simply treat them as a category: you have to either remove them or fill them in.
Imputation describes filling in missing data with estimates based on the rest of the data set. When working with the titanic data set, we simply set all the missing Age values to the median age for the data set. This is an example of a simplistic imputation. In practice, we might get better results using a different imputation method that doesn't fill in all the missing numbers with the same value.
A common way to impute missing data is to choose values based on similar or "neighboring" records. For instance, if two passengers on the Titanic were almost identical but age was missing for one of them, we could fill in the missing age with the age of the neighbor. K-nearest neighbors or KNN imputation is a method for filling missing values that operates on this principle: it takes a record with a missing value, calculates that record's closest neighbors based on other variables in the data set and then sets the missing value to a weighted average of its closest neighbors (closer neighbors receive more weight.).
To perform KNN imputation we need to install a couple of new packages: the "caret" package and the "RANN" package:
In [13]:
# Install the caret package in R Studio and then load it.
# Caret has some dependencies that you will have to install as well.
# install.pacakges("caret")
# install.packages("RANN")
library(caret)
library(RANN)
The caret package offers a wide range of functions for predictive modeling including preProcess(), an extremely useful function for data preparation. preProcess() accepts a matrix or data frame as input and then performs one or more data preprocessing tasks automatically. The preProcess() function can perform scaling, centering, skewness reduction, imputation and even dimensionality reduction all at the same time. The RANN package implements a fast nearest neighbors search that the caret package uses to perform KNN imputation.
Let's start by removing some random mpg values from the mtcars data set and then use preProcess() to impute them back with KNN imputation:
In [14]:
# The following line sets random mpg values to NA
cars$mpg <- ifelse(runif(nrow(mtcars),0,10) > 7, NA, cars$mpg )
summary(cars$mpg) # Check mpg to confirm NA's have been added
Out[14]:
In [15]:
impute <- preProcess(cars, # Run preprocessing on cars
method=c("knnImpute")) # Use knn imputation
cars <- predict(impute, cars) # Predict new values based on preprocessing
summary(cars) # Check cars
Out[15]:
As we can see from the summary output, the NA values in the mpg variable are gone, but the scale of the variables looks a bit strange. It turns out that preProcess() automatically centers and scales your data when performing KNN imputation because data that is on different scales does not work well when calculating the distance of neighbors.
Wrap Up
In the past two lessons, we've learned a variety of methods for preparing text data and numeric data. The majority of data you encounter will likely fall in one of these two categories, but there is one other type of data that appears with enough frequency that you will have to deal with it sooner or later: dates.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.