Kaggle recently released a knowledge competition entitled "House Prices: Advanced Regression Techniques" aimed at giving users an opportunity to explore and make predictions on real world housing data. Predicting home sale prices based on home features is a classic scenario used to teach regression, but you don't usually get to work with real data. This competition presents a chance to show an example of a basic end-to-end data analysis as a practical complement to my 30-part Introduction to R.

The first thing you should do when working with any new data set is determine its size, format and any other relevant information you can find without actually looking at the data itself. Huge data sets can take a long time to load or may not fit in memory at all. Thankfully, the data for this competition is very small--the training set is only 450KB as a .csv file--so it should be easy to work with. In addition, Kaggle provides a text file with the data set containing detailed descriptions of all 79 variables which should make it much easier to explore.

Let's start by reading the data and getting a sense of its shape.

Exploratory Data Analysis

In [1]:

train = read.csv("train.csv")
test = read.csv("test.csv")

Note: I'm using "=" instead of the standard arrow for assignment because the arrow syntax is causing Blogger's html to break down.

In [2]:

dim(train)
dim(test)

1460
81

1459
80

In [3]:

str(train)

'data.frame': 1460 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
 $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
 $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

As we can see the training data contains 1460 records with 81 variables including an ID and the sale price. The test data has 1 fewer variable because it should not contain the prediction target SalePrice:

In [4]:

str(test)

'data.frame': 1459 obs. of  80 variables:
 $ Id           : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
 $ MSSubClass   : int  20 20 60 60 120 60 20 60 20 20 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 3 4 4 4 4 4 4 4 4 4 ...
 $ LotFrontage  : int  80 81 74 78 43 75 NA 63 85 70 ...
 $ LotArea      : int  11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 1 1 1 1 1 1 1 4 4 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 2 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 1 level "AllPub": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 1 5 5 5 1 5 5 5 1 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 13 13 9 9 22 9 9 9 9 13 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 2 3 3 3 3 3 3 3 3 3 ...
 $ Condition2   : Factor w/ 5 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 5 1 1 1 1 1 ...
 $ HouseStyle   : Factor w/ 7 levels "1.5Fin","1.5Unf",..: 3 3 5 5 3 5 3 5 3 3 ...
 $ OverallQual  : int  5 6 5 6 8 6 6 6 7 4 ...
 $ OverallCond  : int  6 6 5 6 5 5 7 5 5 5 ...
 $ YearBuilt    : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
 $ YearRemodAdd : int  1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 4 2 2 2 2 2 2 2 2 ...
 $ RoofMatl     : Factor w/ 4 levels "CompShg","Tar&Grv",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Exterior1st  : Factor w/ 13 levels "AsbShng","AsphShn",..: 11 12 11 11 7 7 7 11 7 9 ...
 $ Exterior2nd  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 14 13 13 7 7 7 13 7 10 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 3 2 3 2 3 3 3 3 3 3 ...
 $ MasVnrArea   : int  0 108 0 20 0 0 0 0 0 0 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 3 4 4 4 4 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 3 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 3 3 3 3 3 3 3 2 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 3 4 3 3 3 3 3 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 4 4 4 4 4 4 4 2 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 5 1 3 3 1 6 1 6 3 1 ...
 $ BsmtFinSF1   : int  468 923 791 602 263 0 935 0 637 804 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 4 6 6 6 6 6 6 6 6 5 ...
 $ BsmtFinSF2   : int  144 0 0 0 0 0 0 0 0 78 ...
 $ BsmtUnfSF    : int  270 406 137 324 1017 763 233 789 663 0 ...
 $ TotalBsmtSF  : int  882 1329 928 926 1280 763 1168 789 1300 882 ...
 $ Heating      : Factor w/ 4 levels "GasA","GasW",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 1 3 1 3 3 5 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 4 levels "FuseA","FuseF",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ X1stFlrSF    : int  896 1329 928 926 1280 763 1187 789 1341 882 ...
 $ X2ndFlrSF    : int  0 0 701 678 0 892 0 676 0 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
 $ BsmtFullBath : int  0 0 0 0 0 0 1 0 1 1 ...
 $ BsmtHalfBath : int  0 0 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  1 1 2 2 2 2 2 2 1 1 ...
 $ HalfBath     : int  0 1 1 1 0 1 0 1 1 0 ...
 $ BedroomAbvGr : int  2 3 3 3 2 3 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 1 1 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 4 3 3 4 4 4 3 4 ...
 $ TotRmsAbvGrd : int  5 6 6 7 5 7 6 7 5 4 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ Fireplaces   : int  0 0 1 1 0 1 0 1 1 0 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA NA 5 3 NA 5 NA 3 4 NA ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ GarageYrBlt  : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 3 3 1 1 2 1 1 1 3 1 ...
 $ GarageCars   : int  1 1 2 2 2 2 2 2 2 2 ...
 $ GarageArea   : int  730 312 482 470 506 440 420 393 506 525 ...
 $ GarageQual   : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  140 393 212 360 0 157 483 0 192 240 ...
 $ OpenPorchSF  : int  0 36 34 36 82 84 21 75 0 0 ...
 $ EnclosedPorch: int  0 0 0 0 0 0 0 0 0 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ScreenPorch  : int  120 0 0 0 144 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 2 levels "Ex","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: 3 NA 3 NA NA NA 1 NA NA 3 ...
 $ MiscFeature  : Factor w/ 3 levels "Gar2","Othr",..: NA 1 NA NA NA NA 3 NA NA NA ...
 $ MiscVal      : int  0 12500 0 0 0 0 500 0 0 0 ...
 $ MoSold       : int  6 6 3 6 1 4 3 5 2 4 ...
 $ YrSold       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...

Inspecting the output above reveals that our data is not entirely clean. First, certain variables contain NA values which could cause problems when we make predictive models later on. Second, the levels of some of the factor variables are not the same across the training set and test set. For instance:

In [5]:

levels(train$MiscFeature)
levels(test$MiscFeature)

"Gar2"
"Othr"
"Shed"
"TenC"

"Gar2"
"Othr"
"Shed"

Differing factor levels could cause problems with predictive modeling later on so we need to resolve these issues before going further. We can make sure the train and test sets have the same factor levels by loading each data set again without converting strings to factors, combining them into one large data set, converting strings to factors for the combined data set and then separating them. Let's change any NA values we find in the character data to a new level called "missing" while we're at it:

In [6]:

train = read.csv("train.csv", stringsAsFactors = FALSE) 
test = read.csv("test.csv", stringsAsFactors = FALSE) 

# Remove the target variable not found in test set
SalePrice = train$SalePrice 
train$SalePrice = NULL

# Combine data sets
full_data = rbind(train,test)

# Convert character columns to factor, filling NA values with "missing"
for (col in colnames(full_data)){
  if (typeof(full_data[,col]) == "character"){
    new_col = full_data[,col]
    new_col[is.na(new_col)] = "missing"
    full_data[col] = as.factor(new_col)
  }
}

# Separate out our train and test sets
train = full_data[1:nrow(train),]
train$SalePrice = SalePrice  
test = full_data[(nrow(train)+1):nrow(full_data),]

Now the factor levels should be identical across both data sets. Let's continue our exploration by looking at a summary of the training data.

In [7]:

summary(train)

       Id           MSSubClass       MSZoning     LotFrontage    
 Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
 1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
 Median : 730.5   Median : 50.0   missing:   0   Median : 69.00  
 Mean   : 730.5   Mean   : 56.9   RH     :  16   Mean   : 70.05  
 3rd Qu.:1095.2   3rd Qu.: 70.0   RL     :1151   3rd Qu.: 80.00  
 Max.   :1460.0   Max.   :190.0   RM     : 218   Max.   :313.00  
                                                 NA's   :259     
    LotArea        Street         Alley      LotShape  LandContour
 Min.   :  1300   Grvl:   6   Grvl   :  50   IR1:484   Bnk:  63   
 1st Qu.:  7554   Pave:1454   missing:1369   IR2: 41   HLS:  50   
 Median :  9478               Pave   :  41   IR3: 10   Low:  36   
 Mean   : 10517                              Reg:925   Lvl:1311   
 3rd Qu.: 11602                                                   
 Max.   :215245                                                   
                                                                  
   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
 AllPub :1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
 missing:   0   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
 NoSeWa :   1   FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
                FR3    :   4              Edwards:100   RRAn   :  26  
                Inside :1052              Somerst: 86   PosN   :  19  
                                          Gilbert: 79   RRAe   :  11  
                                          (Other):707   (Other):  15  
   Condition2     BldgType      HouseStyle   OverallQual      OverallCond   
 Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000  
 Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000  
 Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000  
 PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575  
 RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000  
 PosA   :   1                 1.5Unf : 14   Max.   :10.000   Max.   :9.000  
 (Other):   2                 (Other): 19                                   
   YearBuilt     YearRemodAdd    RoofStyle       RoofMatl     Exterior1st 
 Min.   :1872   Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515  
 1st Qu.:1954   1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222  
 Median :1973   Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220  
 Mean   :1971   Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206  
 3rd Qu.:2000   3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108  
 Max.   :2010   Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61  
                                              (Other):   2   (Other):128  
  Exterior2nd    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation 
 VinylSd:504   BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146  
 MetalSd:214   BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634  
 HdBoard:207   missing:  8   Median :   0.0   Gd:488    Gd: 146   PConc :647  
 Wd Sdng:197   None   :864   Mean   : 103.7   TA:906    Po:   1   Slab  : 24  
 Plywood:142   Stone  :128   3rd Qu.: 166.0             TA:1282   Stone :  6  
 CmentBd: 60                 Max.   :1600.0                       Wood  :  3  
 (Other):136                 NA's   :8                                        
    BsmtQual      BsmtCond     BsmtExposure  BsmtFinType1   BsmtFinSF1    
 Ex     :121   Fa     :  45   Av     :221   ALQ    :220   Min.   :   0.0  
 Fa     : 35   Gd     :  65   Gd     :134   BLQ    :148   1st Qu.:   0.0  
 Gd     :618   missing:  37   missing: 38   GLQ    :418   Median : 383.5  
 missing: 37   Po     :   2   Mn     :114   LwQ    : 74   Mean   : 443.6  
 TA     :649   TA     :1311   No     :953   missing: 37   3rd Qu.: 712.2  
                                            Rec    :133   Max.   :5644.0  
                                            Unf    :430                   
  BsmtFinType2    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
 ALQ    :  19   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
 BLQ    :  33   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
 GLQ    :  14   Median :   0.00   Median : 477.5   Median : 991.5  
 LwQ    :  46   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
 missing:  38   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
 Rec    :  54   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
 Unf    :1256                                                      
  Heating     HeatingQC CentralAir   Electrical     X1stFlrSF      X2ndFlrSF   
 Floor:   1   Ex:741    N:  95     FuseA  :  94   Min.   : 334   Min.   :   0  
 GasA :1428   Fa: 49    Y:1365     FuseF  :  27   1st Qu.: 882   1st Qu.:   0  
 GasW :  18   Gd:241               FuseP  :   3   Median :1087   Median :   0  
 Grav :   7   Po:  1               missing:   1   Mean   :1163   Mean   : 347  
 OthW :   2   TA:428               Mix    :   1   3rd Qu.:1391   3rd Qu.: 728  
 Wall :   4                        SBrkr  :1334   Max.   :4692   Max.   :2065  
                                                                               
  LowQualFinSF       GrLivArea     BsmtFullBath     BsmtHalfBath    
 Min.   :  0.000   Min.   : 334   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :  0.000   Median :1464   Median :0.0000   Median :0.00000  
 Mean   :  5.845   Mean   :1515   Mean   :0.4253   Mean   :0.05753  
 3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :572.000   Max.   :5642   Max.   :3.0000   Max.   :2.00000  
                                                                    
    FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr    KitchenQual 
 Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex     :100  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa     : 39  
 Median :2.000   Median :0.0000   Median :3.000   Median :1.000   Gd     :586  
 Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047   missing:  0  
 3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   TA     :735  
 Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000                
                                                                               
  TotRmsAbvGrd      Functional     Fireplaces     FireplaceQu    GarageType 
 Min.   : 2.000   Typ    :1360   Min.   :0.000   Ex     : 24   2Types :  6  
 1st Qu.: 5.000   Min2   :  34   1st Qu.:0.000   Fa     : 33   Attchd :870  
 Median : 6.000   Min1   :  31   Median :1.000   Gd     :380   Basment: 19  
 Mean   : 6.518   Mod    :  15   Mean   :0.613   missing:690   BuiltIn: 88  
 3rd Qu.: 7.000   Maj1   :  14   3rd Qu.:1.000   Po     : 20   CarPort:  9  
 Max.   :14.000   Maj2   :   5   Max.   :3.000   TA     :313   Detchd :387  
                  (Other):   1                                 missing: 81  
  GarageYrBlt    GarageFinish   GarageCars      GarageArea       GarageQual  
 Min.   :1900   Fin    :352   Min.   :0.000   Min.   :   0.0   Ex     :   3  
 1st Qu.:1961   missing: 81   1st Qu.:1.000   1st Qu.: 334.5   Fa     :  48  
 Median :1980   RFn    :422   Median :2.000   Median : 480.0   Gd     :  14  
 Mean   :1979   Unf    :605   Mean   :1.767   Mean   : 473.0   missing:  81  
 3rd Qu.:2002                 3rd Qu.:2.000   3rd Qu.: 576.0   Po     :   3  
 Max.   :2010                 Max.   :4.000   Max.   :1418.0   TA     :1311  
 NA's   :81                                                                  
   GarageCond   PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
 Ex     :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
 Fa     :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
 Gd     :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
 missing:  81              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
 Po     :   7              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
 TA     :1326              Max.   :857.00   Max.   :547.00   Max.   :552.00  
                                                                             
   X3SsnPorch      ScreenPorch        PoolArea           PoolQC    
 Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex     :   2  
 1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa     :   2  
 Median :  0.00   Median :  0.00   Median :  0.000   Gd     :   3  
 Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   missing:1453  
 3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000                 
 Max.   :508.00   Max.   :480.00   Max.   :738.000                 
                                                                   
     Fence       MiscFeature      MiscVal             MoSold      
 GdPrv  :  59   Gar2   :   2   Min.   :    0.00   Min.   : 1.000  
 GdWo   :  54   missing:1406   1st Qu.:    0.00   1st Qu.: 5.000  
 missing:1179   Othr   :   2   Median :    0.00   Median : 6.000  
 MnPrv  : 157   Shed   :  49   Mean   :   43.49   Mean   : 6.322  
 MnWw   :  11   TenC   :   1   3rd Qu.:    0.00   3rd Qu.: 8.000  
                               Max.   :15500.00   Max.   :12.000  
                                                                  
     YrSold        SaleType    SaleCondition    SalePrice     
 Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
 1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
 Median :2008   COD    :  43   Alloca :  12   Median :163000  
 Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
 3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
 Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
                (Other):   9

Summary output gives us a basic sense of the distribution of each variable, but it also reveals another issue: some of the numeric columns contain NA values. None of the numeric variables contain negative values so encoding the NA's as a negative number is a simple way to convert them to numeric while making it easy to tell which values are actually NA. We will be using a tree-based model in this example so the scale of our numbers shouldn't affect our model and assigning the NA's to -1 will essentially allow -1 to act as a numeric flag for NA values. If we were using a model that scales numeric variables by a learned parameter like linear regression, we might want to use a different solution such as imputing missing values and we'd also want to consider centering, scaling and normalizing the numeric features so that they are on the same scale and have distributions that are roughly normal.

In [8]:

# Fill remaining NA values with -1
train[is.na(train)] = -1
test[is.na(test)] = -1

Now the data should be clean with no NA values, so we can start exploring how home features affect home sales prices. For one, it could be useful to know whether any of the variables are highly correlated with SalePrice. Let's determine whether any variables have a correlation with SalePrice with an absolute value of 0.5 or higher:

In [9]:

for (col in colnames(train)){
    if(is.numeric(train[,col])){
        if( abs(cor(train[,col],train$SalePrice)) > 0.5){
            print(col)
            print( cor(train[,col],train$SalePrice) )
        }
    }
}

[1] "OverallQual"
[1] 0.7909816
[1] "YearBuilt"
[1] 0.5228973
[1] "YearRemodAdd"
[1] 0.507101
[1] "TotalBsmtSF"
[1] 0.6135806
[1] "X1stFlrSF"
[1] 0.6058522
[1] "GrLivArea"
[1] 0.7086245
[1] "FullBath"
[1] 0.5606638
[1] "TotRmsAbvGrd"
[1] 0.5337232
[1] "GarageCars"
[1] 0.6404092
[1] "GarageArea"
[1] 0.6234314
[1] "SalePrice"
[1] 1

The output shows a handful of variables have relatively strong correlations with sale price, with "OverallQual" being the highest at 0.7909816. These variables are likely important for predicting sale prices. Now let's investigate some which numeric variables have low correlations with sales prices:

In [10]:

for (col in colnames(train)){
    if(is.numeric(train[,col])){
        if( abs(cor(train[,col],train$SalePrice)) < 0.1){
            print(col)
            print( cor(train[,col],train$SalePrice) )
        }
    }
}

[1] "Id"
[1] -0.02191672
[1] "MSSubClass"
[1] -0.08428414
[1] "OverallCond"
[1] -0.07785589
[1] "BsmtFinSF2"
[1] -0.01137812
[1] "LowQualFinSF"
[1] -0.02560613
[1] "BsmtHalfBath"
[1] -0.01684415
[1] "X3SsnPorch"
[1] 0.04458367
[1] "PoolArea"
[1] 0.09240355
[1] "MiscVal"
[1] -0.02118958
[1] "MoSold"
[1] 0.04643225
[1] "YrSold"
[1] -0.02892259

The year and month sold don't appear to have much of a connection to sales prices. Interestingly, "overall condition" doesn't have a strong correlation to sales price, while "overall quality" had the strongest correlation.

Next, let's determine whether any of the numeric variables are highly correlated with one another.

In [11]:

cors = cor(train[ , sapply(train, is.numeric)])
high_cor = which(abs(cors) > 0.6 & (abs(cors) < 1))
rows = rownames(cors)[((high_cor-1) %/% 38)+1]
cols = colnames(cors)[ifelse(high_cor %% 38 == 0, 38, high_cor %% 38)]
vals = cors[high_cor]

cor_data = data.frame(cols=cols, rows=rows, correlation=vals)
cor_data

	cols	rows	correlation
1	GarageCars	OverallQual	0.600670716590715
2	SalePrice	OverallQual	0.790981600583805
3	BsmtFullBath	BsmtFinSF1	0.649211753574265
4	X1stFlrSF	TotalBsmtSF	0.819529975005033
5	SalePrice	TotalBsmtSF	0.613580551559196
6	TotalBsmtSF	X1stFlrSF	0.819529975005033
7	SalePrice	X1stFlrSF	0.605852184691915
8	GrLivArea	X2ndFlrSF	0.687501064166604
9	HalfBath	X2ndFlrSF	0.609707300271744
10	TotRmsAbvGrd	X2ndFlrSF	0.616422635491543
11	X2ndFlrSF	GrLivArea	0.687501064166604
12	FullBath	GrLivArea	0.630011646251115
13	TotRmsAbvGrd	GrLivArea	0.825489374308843
14	SalePrice	GrLivArea	0.708624477612652
15	BsmtFinSF1	BsmtFullBath	0.649211753574265
16	GrLivArea	FullBath	0.630011646251115
17	X2ndFlrSF	HalfBath	0.609707300271744
18	TotRmsAbvGrd	BedroomAbvGr	0.676619935742649
19	X2ndFlrSF	TotRmsAbvGrd	0.616422635491543
20	GrLivArea	TotRmsAbvGrd	0.825489374308843
21	BedroomAbvGr	TotRmsAbvGrd	0.676619935742649
22	OverallQual	GarageCars	0.600670716590715
23	GarageArea	GarageCars	0.882475414281462
24	SalePrice	GarageCars	0.640409197258352
25	GarageCars	GarageArea	0.882475414281462
26	SalePrice	GarageArea	0.623431438918362
27	OverallQual	SalePrice	0.790981600583805
28	TotalBsmtSF	SalePrice	0.613580551559196
29	X1stFlrSF	SalePrice	0.605852184691915
30	GrLivArea	SalePrice	0.708624477612652
31	GarageCars	SalePrice	0.640409197258352
32	GarageArea	SalePrice	0.623431438918362

Note that since the table above was constructed from a symmetric correlation matrix, each pair appears twice.

The table shows that 11 variables have correlations above 0.6, leaving out the target variable SalesPrice. The highest correlation is between GarageCars and GarageArea, which makes sense because we'd expect a garage that can park more cars to have more area. Highly correlated variables can cause problems with certain types of predictive models but since no variable pairs have a correlations above 0.9 and we will be using a tree-based model, let's keep them all.

Now let's explore the distributions of the numeric variables with density plots. This can help us get identify outlines and whether different variable and our target variable are roughly normal, skewed or exhibit other oddities.

In [12]:

for (col in colnames(train)){
  if(is.numeric(train[,col])){
    plot(density(train[,col]), main=col)
  }
}

There are too many variables to discuss all the plots in detail but, a quick glance reveals that many of the numeric variables show right skew. Also, many variables have significant density near zero, indicating certain features are only present in subset of homes. It also appears that far more homes sell in the spring and summer months than winter. Lastly, the target variable SalePrice appears roughly normal, but it has tail that goes off to the right, so a handful of homes sell for significantly more than the average. Making accurate predictions for these pricey homes may be the most difficult part of making a good predictive model.

Predictive Modeling

Before jumping into modeling, we should determine whether we have to alter our data structures get them to work with our model and whether we want to add new features. We will use the XGBoost tree model for this problem. The XGBoost package in R accepts data in a specific numeric matrix format, so if we were to use it directly, we'd have to one-hot encode all of the categorical variables and put the data into a large numeric matrix. To make things easier, we will use R's caret package interface to XGBoost, which will allow us to use our current data unaltered.

This data set already has a large number of features, so adding more may not do much to improve the model, but upon inspecting the variable text file, I noticed a couple key variables I was expecting to see were missing. Namely, total square footage and total number of bathrooms are common features used to classify homes, but these features are split up into different parts in the data set, such as above grade square footage, basement square footage and so on. Let's add two new features for total square footage and total bathrooms:

In [13]:

# Add variable that combines above grade living area with basement sq footage
train$total_sq_footage = train$GrLivArea + train$TotalBsmtSF
test$total_sq_footage = test$GrLivArea + test$TotalBsmtSF

# Add variable that combines above ground and basement full and half baths
train$total_baths = train$BsmtFullBath + train$FullBath + (0.5 * (train$BsmtHalfBath + train$HalfBath))
test$total_baths = test$BsmtFullBath + test$FullBath + (0.5 * (test$BsmtHalfBath + test$HalfBath))

# Remove Id since it should have no value in prediction
train$Id = NULL    
test$Id = NULL

Now we are ready to create a predictive model. Let's start by loading in some pacakges:

In [14]:

library(caret)
library(plyr)
library(xgboost)
library(Metrics)

Next let's create the control object and tuning variable grid we need to pass to our caret model. The target metric used to judge this competition is root mean squared logarithmic error or RMSLE. Caret optimizes root mean squared error for regression by default, so if we want to optimize for RMSLE we should pass in a custom summary function via our caret control object. The R package "Metrics" has a function for computing RMSLE so we can use that to compute the performance metric inside our custom summary function.

In [15]:

# Create custom summary function in proper format for caret
custom_summary = function(data, lev = NULL, model = NULL){
    out = rmsle(data[, "obs"], data[, "pred"])
    names(out) = c("rmsle")
    out
}

# Create control object
control = trainControl(method = "cv",  # Use cross validation
                        number = 5,     # 5-folds
                        summaryFunction = custom_summary                      
)

# Create grid of tuning parameters
grid = expand.grid(nrounds=c(100, 200, 400, 800), # Test 4 values for boosting rounds
                    max_depth= c(4, 6),           # Test 2 values for tree depth
                    eta=c(0.1, 0.05, 0.025),      # Test 3 values for learning rate
                    gamma= c(0.1), 
                    colsample_bytree = c(1), 
                    min_child_weight = c(1))

Now we can train our model, using the custom metric rmsle:

In [16]:

set.seed(12)

xgb_tree_model =  train(SalePrice~.,      # Predict SalePrice using all features
                        data=train,
                        method="xgbTree",
                        trControl=control, 
                        tuneGrid=grid, 
                        metric="rmsle",     # Use custom performance metric
                        maximize = FALSE)   # Minimize the metric

Next let's check the results of training and which tuning parameters were selected:

In [17]:

xgb_tree_model$results

xgb_tree_model$bestTune

	eta	max_depth	gamma	colsample_bytree	min_child_weight	nrounds	rmsle	rmsleSD
1	0.02500000	4.00000000	0.10000000	1.00000000	1.00000000	100.00000000	0.16019977	0.01166578
9	0.05000000	4.00000000	0.10000000	1.00000000	1.00000000	100.00000000	0.13424489	0.01444161
17	0.10000000	4.00000000	0.10000000	1.00000000	1.00000000	100.00000000	0.12952050	0.01401119
5	0.02500000	6.00000000	0.10000000	1.00000000	1.00000000	100.00000000	0.15935052	0.01011617
13	0.05000000	6.00000000	0.10000000	1.00000000	1.00000000	100.00000000	0.13428554	0.01411232
21	0.10000000	6.00000000	0.10000000	1.00000000	1.00000000	100.00000000	0.13317996	0.01358854
2	0.02500000	4.00000000	0.10000000	1.00000000	1.00000000	200.00000000	0.13377490	0.01459911
10	0.05000000	4.00000000	0.10000000	1.00000000	1.00000000	200.00000000	0.12922112	0.01295672
18	0.10000000	4.00000000	0.10000000	1.00000000	1.00000000	200.00000000	0.12819999	0.01247954
6	0.02500000	6.00000000	0.10000000	1.00000000	1.00000000	200.00000000	0.13319329	0.01386566
14	0.05000000	6.00000000	0.10000000	1.00000000	1.00000000	200.00000000	0.13200103	0.01331629
22	0.10000000	6.00000000	0.10000000	1.00000000	1.00000000	200.00000000	0.13252866	0.01253618
3	0.02500000	4.00000000	0.10000000	1.00000000	1.00000000	400.00000000	0.12871158	0.01350337
11	0.05000000	4.00000000	0.10000000	1.00000000	1.00000000	400.00000000	0.12781026	0.01283955
19	0.10000000	4.00000000	0.10000000	1.00000000	1.00000000	400.00000000	0.12829264	0.01301798
7	0.02500000	6.00000000	0.10000000	1.00000000	1.00000000	400.00000000	0.13062452	0.01287916
15	0.05000000	6.00000000	0.10000000	1.00000000	1.00000000	400.00000000	0.13154310	0.01312649
23	0.10000000	6.00000000	0.10000000	1.00000000	1.00000000	400.00000000	0.13265734	0.01261745
4	0.02500000	4.00000000	0.10000000	1.00000000	1.00000000	800.00000000	0.12717267	0.01262923
12	0.05000000	4.00000000	0.10000000	1.00000000	1.00000000	800.00000000	0.12782637	0.01253438
20	0.10000000	4.00000000	0.10000000	1.00000000	1.00000000	800.00000000	0.12907659	0.01296277
8	0.02500000	6.00000000	0.10000000	1.00000000	1.00000000	800.00000000	0.13030443	0.01269102
16	0.05000000	6.00000000	0.10000000	1.00000000	1.00000000	800.00000000	0.13128693	0.01328647
24	0.10000000	6.00000000	0.10000000	1.00000000	1.00000000	800.00000000	0.13270777	0.01261306

	nrounds	max_depth	eta	gamma	colsample_bytree	min_child_weight
4	800.000	4.000	0.025	0.100	1.000	1.000

In this case, the model with a tree depth of 4, trained for 800 rounds with a learning rate of 0.025 was chosen. According to the table, the cross-validated rmsle for this model was 0.12717267, so we'd expect a score close 0.127 if we were to use this model to make predictions on the test set. Before we make predictions let's check which variables ended up being most important to the model:

In [18]:

varImp(xgb_tree_model)

xgbTree variable importance

  only 20 most important variables shown (out of 166)

                     Overall
OverallQual         100.0000
total_sq_footage     89.6357
total_baths           6.5326
YearBuilt             5.8619
LotArea               3.7748
BsmtFinSF1            3.4032
GarageCars            3.3694
X2ndFlrSF             3.1744
YearRemodAdd          2.8551
GrLivArea             2.7669
OverallCond           1.7652
Fireplaces            1.4815
GarageArea            1.4109
OpenPorchSF           1.2184
KitchenQualTA         0.9510
TotalBsmtSF           0.9498
LotFrontage           0.8790
BsmtUnfSF             0.8397
NeighborhoodEdwards   0.7602
KitchenAbvGr          0.6890

As expected, the variable with the highest correlation to SalePrice, OverallQual, was very important to the model. The extra feature for total square footage was also very important and our total bathrooms variable came in a distant third.

Finally, let's make predictions on the test set using the trained model and submit to Kaggle to see if the actual performance is near our cross validation estimate:

In [19]:

test_predictions = predict(xgb_tree_model, newdata=test)

submission = read.csv("sample_submission.csv")
submission$SalePrice = test_predictions
write.csv(submission, "home_prices_xgb_sub1.csv", row.names=FALSE)

Submitting the predictions file yields a test set RMSLE of 0.12678, so our estimate was pretty close to the model's true performance.

We will stop here, but there are many ways to iterate on this initial solution to try to get a better score such as tuning the parameters further or aggregating several models together. Maybe there are more features we could add to improve the model. Maybe a different, simpler model would work better on this problem. Once you have an initial solution in hand it is usually easy to to alter it slightly and generate new predictions. The hardest part is getting started.

Life Is Study

Thursday, September 15, 2016

Kaggle Home Price Prediction Tutorial: Data Exploration and XGBoost with R

Exploratory Data Analysis

Predictive Modeling

No comments:

Post a Comment