Monday, December 15, 2014

Kaggle: Poker Rule Induction


Since there's going to be a lull in new MOOCs for the foreseeable future, I'm going to work on a few more Kaggle competitions. Kaggle put a poker rule induction problem online a couple weeks ago. The following is my R markdown writeup of my first attempt to best the benchmark.


The competition tasks entrants with classifying poker hands based on a sample of 25010 labeled hands. The data is separated into 10 features, indicating the suit and rank of each card. Cards are not given in any particular order. The difficulty is learning rules without making any assumptions about the rules and using domain knowledge to guide learning. Of course it would be easy to sort cards, calculate combinations or hard code poker rules to achieve good results, but the challenge states that you are to act as though you are an alien that doesn't know anything about poker or card games. The methods used are supposed to be general enough to apply to completely different card games with different rules. Several entrants are scoring 99% or above in accuracy, which suggests that people are injecting domain knowledge into solutions. My aim is to run a few basic machine learning algorithms on the raw data without making any assumptions and beat the competition given random forest benchmark accuracy of 0.62408.


I Start by loading some machine learning libraries.
library(caret)
library(randomForest)
library(gbm)
options( java.parameters = "-Xmx3g" )
library(extraTrees)


Next I read in the training and test data
train = read.csv("train.csv")
test = read.csv("test.csv")
#Get rid of ID column
test = test[,2:11]

#Separate labels from training set
labels = as.factor(train$hand)
train = train[,1:10]

#Split training set into partial training set and validation set
part_train = train[1:18000,]
valid = train[-1:-18000,]

labels_part = labels[1:18000]
valid_labels = labels[-1:-18000]


First I run a random forest model
set.seed(12)
tree = randomForest(labels_part~., data=part_train, nodesize=1, ntree=500, mtry=4)

tree_pred = predict(tree, newdata=valid, type="class")

confusionMatrix(tree_pred,valid_labels)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 2819 1465   78   22    0   12    0    1    0    0
##          1  699 1499  262  107   24    2   11    1    4    1
##          2    0    0    0    2    0    0    0    0    0    0
##          3    0    0    0    1    0    0    0    0    0    0
##          4    0    0    0    0    0    0    0    0    0    0
##          5    0    0    0    0    0    0    0    0    0    0
##          6    0    0    0    0    0    0    0    0    0    0
##          7    0    0    0    0    0    0    0    0    0    0
##          8    0    0    0    0    0    0    0    0    0    0
##          9    0    0    0    0    0    0    0    0    0    0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.616         
##                  95% CI : (0.605, 0.628)
##     No Information Rate : 0.502         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
Accuracy of 0.6161. Slightly lower than the benchmark.


Next I try an extra trees classifier.
set.seed(12)
xtrees = extraTrees(y=labels_part, x=part_train, nodesize=1, ntree=500, mtry=4, numRandomCuts = 3)

xtrees_pred = predict(xtrees, newdata=valid)

confusionMatrix(xtrees_pred,valid_labels)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 2749 1488   80   23    2   12    1    1    0    0
##          1  768 1470  260  106   22    2   10    1    4    1
##          2    1    4    0    1    0    0    0    0    0    0
##          3    0    2    0    2    0    0    0    0    0    0
##          4    0    0    0    0    0    0    0    0    0    0
##          5    0    0    0    0    0    0    0    0    0    0
##          6    0    0    0    0    0    0    0    0    0    0
##          7    0    0    0    0    0    0    0    0    0    0
##          8    0    0    0    0    0    0    0    0    0    0
##          9    0    0    0    0    0    0    0    0    0    0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.602         
##                  95% CI : (0.591, 0.614)
##     No Information Rate : 0.502         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
Accuracy of 0.6036. Again, a bit lower than the benchmark. Reaching the benchmark is likely a matter of parameter tuning.


Finally, I try a gbm model.
set.seed(12)

tunecontrol = trainControl(method = "none")

tgrid = expand.grid(n.trees = c(100), interaction.depth=c(15) ,shrinkage=c(0.107) )

gbm_mod = train(labels_part~., data=part_train, method="gbm", trControl=tunecontrol, tuneGrid=tgrid)
pred_gbm = predict(gbm_mod, newdata=valid)
confusionMatrix(pred_gbm ,valid_labels)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 2932 1325   55   18    1   13    0    0    1    0
##          1  585 1617  277  111   23    1   10    1    2    1
##          2    1   13    5    2    0    0    0    0    0    0
##          3    0    6    3    1    0    0    1    1    0    0
##          4    0    2    0    0    0    0    0    0    1    0
##          5    0    0    0    0    0    0    0    0    0    0
##          6    0    0    0    0    0    0    0    0    0    0
##          7    0    0    0    0    0    0    0    0    0    0
##          8    0    0    0    0    0    0    0    0    0    0
##          9    0    1    0    0    0    0    0    0    0    0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.65          
##                  95% CI : (0.638, 0.661)
##     No Information Rate : 0.502         
##     P-Value [Acc > NIR] : <2e-16        
##                                        

The GBM validation accuracy of of 0.6498, beats the benchmark. Increasing n.trees to 2000 gives a validation accuracy of 0.7692.


Rerunning the GBM model on the full training set using the same parameters gives a leaderboard score of 0.81873. Since the extra training data significantly improved accuracy, generating new training examples or letting the algorithm run longer(more than 2000 trees) may also improve accuracy. Since I'm currently at my computer's 8GB RAM limit so I'm content to let this competition rest with a single submission.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.