The competition tasks entrants with classifying poker hands based on a sample of 25010 labeled hands. The data is separated into 10 features, indicating the suit and rank of each card. Cards are not given in any particular order. The difficulty is learning rules without making any assumptions about the rules and using domain knowledge to guide learning. Of course it would be easy to sort cards, calculate combinations or hard code poker rules to achieve good results, but the challenge states that you are to act as though you are an alien that doesn't know anything about poker or card games. The methods used are supposed to be general enough to apply to completely different card games with different rules. Several entrants are scoring 99% or above in accuracy, which suggests that people are injecting domain knowledge into solutions. My aim is to run a few basic machine learning algorithms on the raw data without making any assumptions and beat the competition given random forest benchmark accuracy of 0.62408.
I Start by loading some machine learning libraries.
library(caret)
library(randomForest)
library(gbm)
options( java.parameters = "-Xmx3g" )
library(extraTrees)
Next I read in the training and test data
train = read.csv("train.csv")
test = read.csv("test.csv")
#Get rid of ID column
test = test[,2:11]
#Separate labels from training set
labels = as.factor(train$hand)
train = train[,1:10]
#Split training set into partial training set and validation set
part_train = train[1:18000,]
valid = train[-1:-18000,]
labels_part = labels[1:18000]
valid_labels = labels[-1:-18000]
First I run a random forest model
set.seed(12)
tree = randomForest(labels_part~., data=part_train, nodesize=1, ntree=500, mtry=4)
tree_pred = predict(tree, newdata=valid, type="class")
confusionMatrix(tree_pred,valid_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 2819 1465 78 22 0 12 0 1 0 0
## 1 699 1499 262 107 24 2 11 1 4 1
## 2 0 0 0 2 0 0 0 0 0 0
## 3 0 0 0 1 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.616
## 95% CI : (0.605, 0.628)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : <2e-16
##
Accuracy of 0.6161. Slightly lower than the benchmark.
Next I try an extra trees classifier.
set.seed(12)
xtrees = extraTrees(y=labels_part, x=part_train, nodesize=1, ntree=500, mtry=4, numRandomCuts = 3)
xtrees_pred = predict(xtrees, newdata=valid)
confusionMatrix(xtrees_pred,valid_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 2749 1488 80 23 2 12 1 1 0 0
## 1 768 1470 260 106 22 2 10 1 4 1
## 2 1 4 0 1 0 0 0 0 0 0
## 3 0 2 0 2 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.602
## 95% CI : (0.591, 0.614)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : <2e-16
##
Accuracy of 0.6036. Again, a bit lower than the benchmark. Reaching the benchmark is likely a matter of parameter tuning.
Finally, I try a gbm model.
set.seed(12)
tunecontrol = trainControl(method = "none")
tgrid = expand.grid(n.trees = c(100), interaction.depth=c(15) ,shrinkage=c(0.107) )
gbm_mod = train(labels_part~., data=part_train, method="gbm", trControl=tunecontrol, tuneGrid=tgrid)
pred_gbm = predict(gbm_mod, newdata=valid)
confusionMatrix(pred_gbm ,valid_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 2932 1325 55 18 1 13 0 0 1 0
## 1 585 1617 277 111 23 1 10 1 2 1
## 2 1 13 5 2 0 0 0 0 0 0
## 3 0 6 3 1 0 0 1 1 0 0
## 4 0 2 0 0 0 0 0 0 1 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0
## 9 0 1 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.65
## 95% CI : (0.638, 0.661)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : <2e-16
##
The GBM validation accuracy of of 0.6498, beats the benchmark. Increasing n.trees to 2000 gives a validation accuracy of 0.7692.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.