Day 4 of the Beat 5 Kaggle Benchmarks in 5 Days challenge
For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.
To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.
Using pandas I loaded the train and test data sets into Python. I then used all of the columns as features for the model, which were conveniently all numerical. Here is the Python code for the scikit-learn random forest classifier:
import pandas as pd
from sklearn import ensemble
from sklearn import cross_validation
from sklearn import metrics
# Load the training and test data sets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Create numpy arrays for use with scikit-learn
train_X = train.drop(['Id','Cover_Type'],axis=1).values
train_y = train.Cover_Type.values
test_X = test.drop('Id',axis=1).values
# Split the training set into training and validation sets
X,X_,y,y_ = cross_validation.train_test_split(train_X,train_y,test_size=0.2)
# Train and predict with the random forest classifier
rf = ensemble.RandomForestClassifier()
rf.fit(X,y)
y_rf = rf.predict(X_)
print metrics.classification_report(y_,y_rf)
print metrics.accuracy_score(y_,y_rf)
# Retrain with entire training set and predict test set.
rf.fit(train_X,train_y)
y_test_rf = rf.predict(test_X)
# Write to CSV
pd.DataFrame({'Id':test.Id.values,'Cover_Type':y_test_rf})\
.sort_index(ascending=False,axis=1).to_csv('rf1.csv',index=False)