The first part of any data analysis or predictive modeling task is an initial exploration of the data. Even if you collected the data yourself and you already have a list of questions in mind that you want to answer, it is important to explore the data before doing any serious analysis, since oddities in the data can cause bugs and muddle your results. Before exploring deeper questions, you have to answer many simpler ones about the form and quality of data. That said, it is important to go into your initial data exploration with a big picture question in mind since the goal of your analysis should inform how you prepare the data.
This lesson aims to raise some of the questions you should consider when you look at a new data set for the first time and show how to perform various R operations related to those questions. We are going to cover a lot of ground in this lesson, touching briefly on many topics from data cleaning to graphing to feature engineering. We will cover many of these topics in future lessons in greater detail.
In this lesson, we will explore the Titanic disaster training set available from Kaggle.com, a website dedicated to data science competitions and education. You need to create a Kaggle account and accept the rules for the Titanic competition to download the data set. The data set consists of 889 passengers who rode aboard the Titanic.
Exploring The Variables
The first step in exploratory analysis is reading in the data and then exploring the variables. It is important to get a sense of how many variables and cases there are, the data types of the variables and the range of values they take on.
We'll start by changing our working directory and reading in the data:
In [1]:
setwd("C:/Users/Greg/Desktop/Kaggle/titanic")
titanic_train <- read.csv("titanic_train.csv")
When working in RStudio, data you load will appear in the environment pane in the upper right corner. It should show the name of the data frame you used to store the data as well as the number of rows (obs) and columns (variables.). Clicking the arrow next to the name of the data frame will show more about the data frame's structure including the variable names, data types and the first few values for each variable.
Since RStudio automatically shows us the structure of the data, using the str() function is somewhat redundant, but it is always there if you need it:
In [2]:
str(titanic_train)
After determining the data's dimensions and basic data types, it is a good idea look at a summary of the data:
In [3]:
summary(titanic_train)
Out[3]:
Summary gives a concise overview of each variable including basic summary statistics for numeric variables. Summary does not, however, necessarily give us enough information to determine what each variable means. Certain variables like "Age" and "Fare" are self-explanatory, while others like "SibSp" and "Parch" are not. Whoever collects or provides data for download should also provide a list of variable descriptions. In this case, Kaggle provides a list of descriptions on the data download page:
In [4]:
# VARIABLE DESCRIPTIONS:
# survival Survival
# (0 = No; 1 = Yes)
# pclass Passenger Class
# (1 = 1st; 2 = 2nd; 3 = 3rd)
# name Name
# sex Sex
# age Age
# sibsp Number of Siblings/Spouses Aboard
# parch Number of Parents/Children Aboard
# ticket Ticket Number
# fare Passenger Fare
# cabin Cabin
# embarked Port of Embarkation
# (C = Cherbourg; Q = Queenstown; S = Southampton)
After looking at the data for the first time, you should ask yourself a few questions:
- Do I need all of the variables?
- Should I transform any variables?
- Are there NA values, outliers or other strange values?
- Should I create new variables?
For the rest of this lesson we will address each of these questions in the context of this data set.
Do I Need All of The Variables?
Getting rid of unnecessary variables is a good first step when dealing with any data set, since dropping variables reduces complexity and can make computation on the data faster. Whether you should get rid of a variable or not will depend on size of the data set and the goal of your analysis. With a data set as small as the Titanic data, there's no real need to drop variables from a computing perspective (we have plenty of memory and processing power to deal with such a small data set) but it can still be helpful to drop variables that will only distract from your goal.
This data set is provided in conjunction with a predictive modeling competition where the goal is to use the training data to predict whether passengers of the titanic listed in a second data set survived or not. We won't be dealing with the second data set (known the test set) right now, but we will revisit this competition and make predictions in a future lesson on predictive modeling.
Let's go through each variable and consider whether we should keep it or not in the context of predicting survival:
"PassengerId" is just a number assigned to each passenger. It is nothing more than an arbitrary identifier; we could keep it for identification purposes, but lets remove it anyway:
In [5]:
titanic_train$PassengerId <- NULL # Remove PassengerId
"Survived" indicates whether each passenger lived or died. Since predicting survival is our goal, we definitely need to keep it.
Features that describe passengers numerically or group them into a few broad categories could be useful for predicting survival. The variables Pclass, Sex, Age, SibSp, Parch, Fare and Embarked are either numeric or factors with only a handful of categories. Let's keep all of those variables.
We have 3 more features to consider: Name, Ticket and Cabin.
"Name" appears to be a character string of the name of each passenger encoded as a factor. Let's look at name a little closer:
In [6]:
print( head( sort(titanic_train$Name), 15) ) # sort() returns the names in sorted order
Since the Name factor has 889 levels and there are 889 rows in the data set we know each name is unique. It appears that married women have their maiden names listed in parentheses. In general, a variable that is unique to each case isn't useful for prediction. We could extract last names to try to group family members together, but even then the number of categories would be very large. In addition, the Parch and SibSp variables already contain some information about family relationships, so from the perspective of predictive modeling, the Name variable could be removed. On the other hand, it can be nice to have some way to uniquely identify particular cases and names are interesting from a personal and historical perspective, so let's keep Name, knowing that we won't actually use it in any predictive models we make.
Next, let's look closer at "Ticket":
In [7]:
print( head( titanic_train$Ticket,25 ) )
Ticket has 680 levels: almost as many levels as there are passengers. Factors with almost as many levels as there are records are generally not very useful for prediction. We could try to reduce the number of levels by grouping certain tickets together, but the ticket numbers don't appear to follow any logical pattern we could use for grouping. Let's remove it:
In [8]:
titanic_train$Ticket <- NULL
Finally, let's consider the "Cabin" variable:
In [9]:
print( head( titanic_train$Cabin,25 ) )
Cabin also has quite a few unique values with 146 levels, which indicates it may not be particularly useful for prediction. On the other hand, the names of the levels for the cabin variable seem to have a fairly regular structure: each starts with a capital letter followed by a number. We could use that structure to reduce the number of levels to make categories large enough that they might be useful for prediction. Let's Keep Cabin for now.
As you might have noticed, removing variables is often more of an art than a science. It is easiest to start simple: don't be afraid to remove (or simply ignore) confusing, messy or otherwise troublesome variables temporarily when you're just getting starting with an analysis or predictive modeling task. Data projects are iterative processes: you can start with a simple analysis or model using only a few variables and then expand later by adding more and more of the other variables you initially ignored or removed.
Should I Transform Any Variables?
When you first load a data set, some of the variables may be encoded as data types that don't fit well with what the data really is or what it means. For instance, when we loaded the Titanic data, we did not include the stringsAsFactors = FALSE argument, so all the character variables were turned into factors. After inspecting the data, Sex, Cabin, Embarked and Ticket all appear to be categorical variables that were appropriately turned into factors. Names, however, are unique identifiers, not categories, so it doesn't really make sense to encode Name as a factor.
Let's turn Name back into a character:
In [10]:
titanic_train$Name <- as.character(titanic_train$Name)
Now let's inspect the Survived variable using the table function:
In [11]:
table( titanic_train$Survived ) # Create a table of counts
Out[11]:
Survived is just an integer variable that takes on the value 0 or 1 depending on whether a passenger died or survived respectively. Variables that indicate a state or the presence or absence of something with the numbers 0 and 1 are sometimes called indicator variables or dummy variables (0 indicates absence and 1 indicates presence.). Indicator variables are essentially just a shorthand for encoding a categorical variable with 2 levels. We could instead encode Survived as a factor and give each level names that are more informative than 0 and 1:
In [12]:
new_survived <- factor(titanic_train$Survived)
levels(new_survived) <- c("Died","Survived")
table(new_survived)
Out[12]:
Survived looks a little nicer as factor with appropriate level names, but even so, we're not going to change it. Why not? If you remember, our goal with this data set is predicting survival for the Kaggle competition. It turns out that when submitting predictions for the competition, the predictions need to be encoded as 0 or 1. It would only complicate things to transform Survived, only to convert it back to 0 and 1 later. This shows the importance of having a good understanding of the problem you are working on.
There's one more variable that has a questionable data encoding: Pclass. Pclass is an integer that indicates a passenger's class, with 1 being first class, 2 being second class and 3 being third class. Passenger class is a category, so it doesn't make a lot of sense to encode it as a numeric variable. What's more 1st class would be considered "above" or "higher" than second class, but when encoded as an integer, 1 comes before 2. We can fix this by transforming Pclass into an ordered factor:
In [13]:
titanic_train$Pclass <- ordered(titanic_train$Pclass, levels=c("3","2","1"))
table(titanic_train$Pclass)
Out[13]:
Now it's time to revisit the Cabin variable. We didn't delete it and it is the proper data type, but it has more levels than we'd like. It appears that each Cabin is in a general section of the ship indicated by the capital letter at the start of each factor level:
In [14]:
levels(titanic_train$Cabin)
Out[14]:
If we grouped cabin just by this letter, we could reduce the number of levels while potentially extracting some useful information. Also note the first Cabin level "". Two quotes without a space between them is known as the empty string, which generally indicates a missing character value.
Now let's transform Cabin by the capital letter in the cabin name, keeping the empty string as an extra category:
In [15]:
char_cabin <- as.character(titanic_train$Cabin) # Convert to character
new_Cabin <- ifelse(char_cabin == "", # If the value is ""
"", # Keep it
substr(char_cabin,1,1)) # Else transform it to a substring *
new_Cabin <- factor(new_Cabin ) # Convert back to a factor
table( new_Cabin ) # Inspect the result as a table
Out[15]:
*Note: the substr() function takes a character as input and produces a substring as output. Here we are creating substrings from char_cabin, where each substring starts at index 1 and ends at index 1, effectively stripping off the first character.
The table shows we succeeded in condensing Cabin into a handful of broader categories, but we also discovered something interesting: 688 of the records have Cabin equal to the empty string. In other words, more than 2/3 of the passengers don't even have a cabin listed at all! Discovering and deciding how to handle these sorts of peculiarities is an important part working with data and there often isn't a single correct answer.
Since there are so many missing values, the Cabin variable might be devoid of useful information for prediction. On the other hand, a missing cabin variable could be an indication that a passenger died: after all, how would we know what cabin a passenger stayed in if they weren't around to tell the tale?
Let's keep the new cabin variable:
In [16]:
titanic_train$Cabin <- new_Cabin
This is as far as we'll go with transformations right now, but know that the transformations we've covered here are just the tip of the iceberg.
Are there NA Values, Outliers or Other Strange Values?
Data sets are often littered with missing (NA) data, extreme data points called outliers and other strange values. Missing values, outliers and strange values can negatively affect statistical tests and models and may even cause certain functions to fail.
In R, you can detect NA values with the is.na() function:
In [17]:
dummy_vector <- c(1,1,2,3,NA,4,3,NA)
is.na(dummy_vector) # Check whether values are NA or not
Out[17]:
Detecting NA values is the easy part: it is far more difficult to decide how to handle them. In cases where you have a lot of data and only a few NA values, it might make sense to simply delete records with NA values present. On the other hand, if you have more than a handful of NA values, removing records with NA values could cause you to get rid of a lot of data. NA values in factors are not particularly troubling because you can simply treat NA as an additional category. NA values in numeric variables are more troublesome: you can't just treat NA as a number. As it happens, the Titanic dataset has some NA's in the Age variable:
In [18]:
summary( titanic_train$Age )
Out[18]:
With 177 NA values it's probably not a good idea to throw all those records away. Here are a few ways we could deal with them:
- Replace the NAs with 0s
- Replace the NAs with some central value like the mean or median
- Impute values for the NAs (estimate values using statistical/predictive modeling methods.).
- Split the data set into two parts: 1 set with where records have an Age value and 1 set where Age is NA.
Setting NA values in numeric data to zero makes sense in some cases, but it doesn't make any sense here because a person's age can't be zero. Setting all ages to some central number like the median is a simple fix but there's no telling whether such a central number is a reasonable estimate of age without looking at the distribution of ages. For all we know each age is equally common. We can quickly get a sense of the distribution of ages by creating a histogram with the hist() function:
In [19]:
hist( titanic_train$Age, breaks=20) # Create a histogram of age with 20 bins
From the histogram, we see that ages between 20 and 30 are the most common, so filling in NA values with a central number like the mean or median wouldn't be entirely unreasonable. Let's fill in the NA values with the median value of 28:
In [20]:
na_logical <- is.na( titanic_train$Age ) # Create a logical variable to flag NA values
new_age_variable <- ifelse(na_logical, # If NA was found
28, # Change the value to 28
titanic_train$Age) # Else keep the old value
titanic_train$Age <- new_age_variable # Change the age variable
summary(titanic_train$Age) # Check the new variable
Out[20]:
Since we just added a bunch of 28s to age, let's look at the histogram again for a sanity check. The bar representing 28 to be much taller this time.
In [21]:
hist( titanic_train$Age, breaks=20)
Some of the ages we assigned are probably way off, but it might be better than throwing entire records away. In practice, imputing the missing data (estimating age based on other variables) might have been a better option, but we'll stick with this for now.
Next let's consider outliers. Outliers are extreme numerical values: values that lie far away from the typical values a variable takes on. Creating plots is one of the quickest ways to detect outliers. For instance, the histogram above shows that 1 or 2 passengers were near age 80. Ages near 80 are uncommon for this data set, but in looking at the general shape of the data one or two 80 year olds doesn't seem particularly surprising.
Now let's investigate the "Fare" variable. This time we'll use a boxplot, since boxplots are designed to show the spread of the data and help identify outliers:
In a boxplot, the central box represents 50% of the data and the central bar represents the median. The dotted lines with bars on the ends are "whiskers" which encompass the great majority of the data and circles beyond the whiskers indicate uncommon values. In this case, we have some uncommon values that are so far away from the typical value that the box appears squashed in the plot: this is a clear indication of outliers. Indeed, it looks like one passenger paid almost twice as much as any other passenger. Even the passengers that paid between 200 and 300 are far higher than the vast majority of the other passengers.
For interest's sake, let's check the name of this high roller. The function which() takes a logical vector and returns all the indices for which the logical vector is true. In this case, we want to find the index of the person who paid the maximum Fare. It turns out there is an extension of which() called which.max() that gets the index of the max value of a vector:
In [23]:
high_roller_index <- which.max( titanic_train$Fare ) # Get the index of the max Fare
high_roller_index # Check the index
titanic_train[high_roller_index,] # Use the index to check the record
Out[23]:
Out[23]:
Similar to NA values, there's no single cure for outliers. You can keep them, delete them or transform them in some way to try to reduce their impact. Even if you decide to keep outliers unchanged it is still worth identifying them since they can have disproportionately large influence on your results. Let's keep Miss Ward unchanged.
Data sets can have other strange values beyond NA values and outliers that you may need to address. For example, the large number of empty strings in Cabin variable is an oddity that could undermine its usefulness in prediction. Sometimes data is mislabeled or simply erroneous; bad data can corrupt any sort of analysis so it is important to address these sorts of issues before doing too much work.
Should I Create New Variables?
The variables present when you load a data set aren't always the most useful variables for analysis. Creating new variables that are derivations or combinations existing ones is a common step to take before jumping into an analysis or modeling task.
For example, imagine you are analyzing web site auctions where one of the data fields is a text description of the item being sold. A raw block of text is difficult to use in any sort of analysis, but you could create new variables from it such as a variable storing the length of the description or variables indicating the presence of certain keywords.
Creating a new variable can be as simple as taking one variable and adding, multiplying or dividing by another. Let's create a new variable Family that combines SibSp and Parch to indicate the total number of family members (siblings, spouses, parents and children) a passenger has on board:
In [24]:
titanic_train$Family <- titanic_train$SibSp + titanic_train$Parch
For interest's sake, let's find out who had the most family members on board:
In [25]:
most_family <- which( titanic_train$Family == max(titanic_train$Family))
titanic_train[most_family,]
Out[25]:
*Note: which.max() only returns a single index (the first max) even if multiple records contain the max value. When we used which.max() earlier to find the high roller, we made the implicit assumption that only one person paid the high fare (which turns out to have been an incorrect assumption!). To find indexes of all the records equal to the max, use which()
There were 7 people on board with 8 siblings/spouses and 2 parents/children--they were probably all siblings of one another (they also probably had missing Age data, since we see all the ages are set to 28.). Tragically, all 7 of them passed away. The 8th sibling is likely in the test data for which we are supposed make predictions. Would you predict that the final sibling survived or died?
Wrap Up
In this lesson, we covered several general questions you should address when you first inspect a data set. Your first goal should be to explore the structure of the data to clean it and prepare the variables for your analysis. Once your data is it the right format, you can move from exploring structure to exploring meaning.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.