This post outlines the performance of 3 different machine learning algorithms–random forests, K-nearest neighbors and stochastic gradient boosting–run on custom features extracted from the MNIST data set, a collection of 28x28 pixel handwritten digits. This is a well known data set that is currently provided as part of a Kaggle competition; it has been studied and crunched by thousands of people and there are neural networks and other techniques out there achieving 99%+ test set accuracy. This write up details a first brush attempt with no foreknowledge of the known best features or methods to tackle this problem. It also serves as an example of how much you can learn from MOOCs. A year ago I didn't know anything about machine learning or the R language.
I begin by loading in some useful R libraries:
library(caret)
library(randomForest)
library(gbm)
library(kknn)
Next I read the data and separate the labels from the pixel data
train = read.csv("train.csv")
test= read.csv("test.csv")
labels= as.factor(train[ ,1])
train = train[ ,2:785]
I Combine the training and and test sets temporarily when creating new features
train = rbind(train,test)
Next I Create some potentially useful meta features for training. Each training case is a vector of 784 pixel intensity values, from 0-256.
#Total pixel intensity
total_inensity = rowSums(train)
#Number of pixels with non-zero intensity
non_zero = rowSums(train!=0)
#Average pixel intensity
average_inensity = total_inensity/non_zero
#Number of low inensity pixels
between_0_100 = rowSums(train>0 & train<100)
#Proportion of low inensity pixels
prop_between_0_100 = between_0_100/non_zero
I put the new features in a new training data frame
new_train=data.frame("total_inensity"=total_inensity,"non_zero"=non_zero,"average_inensity"=average_inensity,"between_0_100"=between_0_100,"prop_between_0_100"=prop_between_0_100)
Over the next 40 lines, I create new features that attempt to capture some of the structure of the images. I make 28 features for the proportion of an image's total intensity contained in each 1 pixel row do the same for each 1 pixel column. I create 16 features based on intensity in each large patch when splitting the image into a 4x4 grid and 49 features based on intensity of each small section when splitting an image into a 7x7 grid. This reduces the total number of features from 784 to 126. My hope is that this will reduce the feature space without throwing out too much useful information.
vertical_seq = seq(1,784,by=28)
#Features for proportion of total pixel intensity in horizontal strips
for (x in 1:28){
row_intensity = rowSums(train[,(1+(28*(x-1))):(28+(28*(x-1)))])
new_train[paste("h",x,sep="")] = row_intensity/total_inensity
}
#Features for proportion of total pixel intensity in vertical strips
for (x in 1:28){
new_train[paste("v",x,sep="")] = rowSums(train[,vertical_seq+(x-1)])/total_inensity
}
#Features for proportion of total pixel intensity in large patches (splits the image into a 4x4 grid and finds the proportion of total pixel inensity in each section)
for (x in 1:4){
for (y in 1:4){
line = 1:7 + ((x-1)*7) + ((y-1)*196)
for (z in 1:6){
line=c(line,line[1:7]+(28*z))
}
new_train[paste("section",x,y,sep="")] = rowSums(train[line])/total_inensity
}
}
#Features for proportion of total pixel intensity in small patches (splits the image into a 7x7 grid and finds the proportion of total pixel inensity in each section)
for (x in 1:7){
for (y in 1:7){
line = 1:4 + ((x-1)*4) + ((y-1)*112)
for (z in 1:3){
line=c(line,line[1:4]+(28*z))
}
new_train[paste("smallsection",x,y,sep="")] = rowSums(train[line])/total_inensity
}
}
Now I separate the testing data from the training data and create a smaller partial training set and validation set.
test = new_train[42001:70000, ]
new_train= new_train[1:42000,]
#Training and validation sets
part_train = new_train[1:30000,]
valid = new_train[30001:42000,]
I begin by training a random forest model.
#Random forest model
set.seed(12)
rf_mod1= randomForest(labels[1:30000]~., data=part_train, nodesize=1, ntree=250, mtry=5)
rf_pred = predict(rf_mod1,newdata=valid)
confusionMatrix(rf_pred,labels[30001:42000])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1175 0 7 1 0 7 7 1 2 4
## 1 0 1330 0 8 2 1 2 0 11 2
## 2 1 6 1125 20 0 4 1 7 0 1
## 3 0 3 7 1178 0 28 0 1 7 16
## 4 1 2 3 1 1091 2 1 4 1 28
## 5 1 0 1 12 0 1008 4 1 11 2
## 6 2 1 5 2 5 5 1146 0 5 0
## 7 0 2 5 15 0 0 0 1232 2 21
## 8 14 6 4 12 3 11 6 2 1130 13
## 9 1 2 0 9 39 10 0 20 5 1126
##
## Overall Statistics
##
## Accuracy : 0.962
## 95% CI : (0.958, 0.965)
## No Information Rate : 0.113
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.957
## Mcnemar's Test P-Value : NA
The confusion matrix shows an overall validation accuracy of 0.962
Next, a KNN Model
knnpred = kknn(labels[1:30000]~., train=part_train, test=valid, k=3)
kpred = fitted.values(knnpred)
confusionMatrix(fitted.values(knnpred),labels[30001:42000])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1170 0 15 4 2 7 6 1 17 7
## 1 0 1332 3 4 5 2 2 3 12 3
## 2 4 4 1085 25 4 2 6 10 6 1
## 3 1 3 20 1153 3 52 1 6 14 12
## 4 2 1 4 0 1030 2 2 4 4 22
## 5 1 1 0 22 0 942 8 2 15 4
## 6 11 3 7 0 8 22 1136 0 5 0
## 7 0 2 6 14 0 1 0 1200 2 23
## 8 6 5 15 23 3 35 6 2 1088 8
## 9 0 1 2 13 85 11 0 40 11 1133
##
## Overall Statistics
##
## Accuracy : 0.939
## 95% CI : (0.935, 0.943)
## No Information Rate : 0.113
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.932
## Mcnemar's Test P-Value : NA
KNN comes in with a validation Accuracy of 0.939
Finally I train a stochastic gradient boosting model using the caret package.
tunecontrol = trainControl(method = "repeatedcv",
number = 2,
repeats = 1
)
tgrid = expand.grid(n.trees = c(100),interaction.depth=c(7) ,shrinkage=c(0.107) )
gbm_mod = train(labels[1:30000]~., data=part_train, method= 'gbm', trControl=tunecontrol, tuneGrid=tgrid)
pred_gbm = predict(gbm_mod, newdata=valid)
confusionMatrix(pred_gbm,labels[30001:42000])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1165 0 6 1 1 5 9 1 2 6
## 1 0 1329 1 4 2 1 2 0 10 5
## 2 1 5 1108 19 1 5 4 12 2 1
## 3 1 4 14 1183 0 30 0 4 10 10
## 4 2 1 4 1 1093 2 2 4 1 14
## 5 1 2 3 15 1 1009 6 1 7 3
## 6 5 1 8 0 6 3 1137 0 4 1
## 7 0 1 5 9 0 2 0 1231 1 20
## 8 19 8 7 18 7 15 7 3 1132 11
## 9 1 1 1 8 29 4 0 12 5 1142
##
## Overall Statistics
##
## Accuracy : 0.961
## 95% CI : (0.957, 0.964)
## No Information Rate : 0.113
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.956
## Mcnemar's Test P-Value : NA
GBM achieves a Validation Accuracy of 0.961
As a final step, I combine the 3 classifiers into an ensemble prediction. If 2 or 3 classifiers make the same prediction, I choose that prediction, otherwise defer I to the random forest model.
comb_pred = as.factor(ifelse(pred_gbm==kpred,kpred,rf_pred))
levels(comb_pred) = 0:9
confusionMatrix(comb_pred,labels[30001:42000])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1177 0 5 2 1 6 7 2 2 6
## 1 0 1333 0 4 2 1 2 0 10 2
## 2 1 5 1124 17 1 3 1 6 1 1
## 3 0 3 9 1193 0 26 0 2 8 12
## 4 1 1 3 1 1092 2 1 3 1 17
## 5 1 0 1 10 0 1015 4 1 5 3
## 6 2 1 6 0 6 4 1146 0 3 0
## 7 0 2 5 14 0 0 0 1238 2 18
## 8 12 6 4 10 3 11 6 1 1137 9
## 9 1 1 0 7 35 8 0 15 5 1145
##
## Overall Statistics
##
## Accuracy : 0.967
## 95% CI : (0.963, 0.97)
## No Information Rate : 0.113
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.963
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9849 0.986 0.9715 0.9483 0.9579 0.9433
## Specificity 0.9971 0.998 0.9967 0.9944 0.9972 0.9977
## Pos Pred Value 0.9743 0.984 0.9690 0.9521 0.9733 0.9760
## Neg Pred Value 0.9983 0.998 0.9970 0.9940 0.9956 0.9944
## Prevalence 0.0996 0.113 0.0964 0.1048 0.0950 0.0897
## Detection Rate 0.0981 0.111 0.0937 0.0994 0.0910 0.0846
## Detection Prevalence 0.1007 0.113 0.0967 0.1044 0.0935 0.0867
## Balanced Accuracy 0.9910 0.992 0.9841 0.9714 0.9776 0.9705
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.9820 0.976 0.9685 0.9439
## Specificity 0.9980 0.996 0.9943 0.9933
## Pos Pred Value 0.9812 0.968 0.9483 0.9408
## Neg Pred Value 0.9981 0.997 0.9966 0.9937
## Prevalence 0.0973 0.106 0.0978 0.1011
## Detection Rate 0.0955 0.103 0.0948 0.0954
## Detection Prevalence 0.0973 0.107 0.0999 0.1014
## Balanced Accuracy 0.9900 0.986 0.9814 0.9686
The ensemble increases validation accuracy to 0.967
Using the full training data and full test data I'd expect similar performance. Rerunning the models on the full training set and submitting to Kaggle gives a test set accuracy of 0.96443. While this accuracy is much lower than models using the best known methods, it shows that respectable accuracy is possible with limited domain knowledge, simple models and limited computing power.
As a final wrap up, I thought it would be interesting to plot some of the images that the final ensemble classified incorrectly.
plotter = function(img_num){
image(t(matrix(as.matrix(train[img_num,]),nrow=28,byrow=TRUE)[28:1,]), axes = FALSE, col = grey(seq(1, 0, length = 256)))
}
first_10_missclassified = head(which(comb_pred!=labels[30001:42000]), 10)
for (error in first_10_missclassified){
print(paste("Estimate: ",comb_pred[error]))
print(paste("Actual: ",labels[30001:42000][error]))
plotter(30000+error)
}
## [1] "Estimate: 5"
## [1] "Actual: 8"
## [1] "Estimate: 7"
## [1] "Actual: 8"
## [1] "Estimate: 9"
## [1] "Actual: 4"
## [1] "Estimate: 3"
## [1] "Actual: 5"
## [1] "Estimate: 3"
## [1] "Actual: 5"
## [1] "Estimate: 7"
## [1] "Actual: 3"
## [1] "Estimate: 4"
## [1] "Actual: 9"
## [1] "Estimate: 7"
## [1] "Actual: 9"
## [1] "Estimate: 4"
## [1] "Actual: 7"
## [1] "Estimate: 2"
## [1] "Actual: 1"
And I thought my handwriting was bad!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.