Skip to content

Instantly share code, notes, and snippets.

@dhammack
Created December 7, 2013 04:56
Show Gist options
  • Save dhammack/7837470 to your computer and use it in GitHub Desktop.
Save dhammack/7837470 to your computer and use it in GitHub Desktop.
A late-night quick rundown of some different classifiers on the UCI starcraft dataset.
{
"metadata": {
"name": "Untitled9"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# Starcraft data analysis #\n\nLet's see how well we can do on this starcraft dataset\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "%pylab inline\nX = loadtxt('skill-clean.csv', delimiter=',')\nprint X.shape\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Populating the interactive namespace from numpy and matplotlib\n(3338L, 20L)"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": "What did it do?"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print X[0]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[ 6.10000000e+01 1.00000000e+00 2.10000000e+01 8.00000000e+00\n 2.40000000e+02 4.69962000e+01 8.20114000e-04 1.68517000e-04\n 6.00000000e+00 0.00000000e+00 4.49000000e-05 1.98849600e-03\n 9.40227000e+01 9.05311000e+01 4.10170000e+00 1.50000000e+01\n 5.72960000e-04 5.00000000e+00 0.00000000e+00 0.00000000e+00]\n"
}
],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": "It floatified everything. Let's get our target out of there."
},
{
"cell_type": "code",
"collapsed": false,
"input": "y = array(X[:,1],dtype='int32')\nprint y.shape, y[0:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(3338L,) [1 1 1 1 1 1 1 1 1 1]\n"
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's remove the target and index from each record."
},
{
"cell_type": "code",
"collapsed": false,
"input": "X = X[:,2:]\nprint X.shape, X[0,:]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(3338L, 18L) [ 2.10000000e+01 8.00000000e+00 2.40000000e+02 4.69962000e+01\n 8.20114000e-04 1.68517000e-04 6.00000000e+00 0.00000000e+00\n 4.49000000e-05 1.98849600e-03 9.40227000e+01 9.05311000e+01\n 4.10170000e+00 1.50000000e+01 5.72960000e-04 5.00000000e+00\n 0.00000000e+00 0.00000000e+00]\n"
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Time to split into training and test."
},
{
"cell_type": "code",
"collapsed": false,
"input": "split = X.shape[0] * 0.8\ninds = permutation(range(X.shape[0]))\nX = X[inds,:]\ny = y[inds]\nXtrn, Xtst = X[:split,:], X[split:,:]\nytrn, ytst = y[:split], y[split:]\nprint Xtrn.shape, Xtst.shape",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(2670L, 18L) (668L, 18L)\n"
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's do some scaling by feature."
},
{
"cell_type": "code",
"collapsed": false,
"input": "mean, upper, lower = Xtrn.mean(axis=0), Xtrn.max(axis=0), Xtrn.min(axis=0)\nXtrn = (Xtrn-mean)/(upper-lower)\nXtst = (Xtst-mean)/(upper-lower)\nprint Xtrn.shape, Xtst.shape",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(2670L, 18L) (668L, 18L)\n"
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Ok, let's throw some classifiers and regressors at it."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import sklearn\nfrom sklearn.svm import SVC\n\nsvm_model = SVC(kernel='rbf', C=0.5)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.390718562874\n"
}
],
"prompt_number": 42
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Wow, pretty horrible."
},
{
"cell_type": "code",
"collapsed": false,
"input": "svm_model = SVC(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.411676646707\n"
}
],
"prompt_number": 200
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Not a big improvement. Maybe if we tell it that we're doing regression..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from sklearn.svm import SVR\nsvm_model = SVR(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.569161678222\n"
}
],
"prompt_number": 211
},
{
"cell_type": "markdown",
"metadata": {},
"source": "That's the $R^2$ value, so it's not the same as classification. Let's see how well it classifies..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "yhat = around(svm_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.422155688623\n"
}
],
"prompt_number": 212
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Not a great improvement. Are we overfitting a lot?"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print 'in sample', svm_model.score(Xtrn,ytrn)\nprint 'out of sample', svm_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "in sample "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "0.546991652417\nout of sample "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "0.569161678222\n"
}
],
"prompt_number": 213
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Maybe this is just a tough problem. Let throw some tree models at it and then call it quits."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from sklearn.ensemble import GradientBoostingRegressor as GBR\ngb_model = GBR()\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.632937458544\n"
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Hey there! Let's see what the classification accuracy is"
},
{
"cell_type": "code",
"collapsed": false,
"input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.417664670659\n"
}
],
"prompt_number": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Well that's disappointing, I was expecting more. Let's try tuning the parameters a little."
},
{
"cell_type": "code",
"collapsed": false,
"input": "gb_model = GBR(n_estimators=100, max_depth=3, min_samples_split=100, min_samples_leaf=100,\n subsample=0.5, learning_rate=0.1)\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.645185998476\n"
}
],
"prompt_number": 106
},
{
"cell_type": "code",
"collapsed": false,
"input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.440119760479\n"
}
],
"prompt_number": 107
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Not bad, it helped a little. How about a random forest"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from sklearn.ensemble import RandomForestRegressor as RFR\nrf_model = RFR(n_estimators=100, max_features='sqrt', max_depth=3,\n min_samples_split=10, min_samples_leaf=10, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.554587772605\n"
}
],
"prompt_number": 108
},
{
"cell_type": "markdown",
"metadata": {},
"source": "And I'll do a little parameter tuning."
},
{
"cell_type": "code",
"collapsed": false,
"input": "rf_model = RFR(n_estimators=200, max_features='auto', max_depth=15,\n min_samples_split=1, min_samples_leaf=1, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.623351946439\n"
}
],
"prompt_number": 132
},
{
"cell_type": "code",
"collapsed": false,
"input": "yhat = around(rf_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.422155688623\n"
}
],
"prompt_number": 133
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Not able to be the GBM. Let's try with extremely randomized trees"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from sklearn.ensemble import ExtraTreesRegressor as XR\nxr_model = XR(n_estimators=10, max_depth=5, min_samples_split=2,\n min_samples_leaf=2, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.559458645933\n"
}
],
"prompt_number": 135
},
{
"cell_type": "markdown",
"metadata": {},
"source": "And a little parameter tuning"
},
{
"cell_type": "code",
"collapsed": false,
"input": "xr_model = XR(n_estimators=100, max_depth=20, min_samples_split=1,\n min_samples_leaf=10, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst), xr_model.score(Xtrn, ytrn)\n",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.613048652136 "
},
{
"output_type": "stream",
"stream": "stdout",
"text": "0.714282553314\n"
}
],
"prompt_number": 154
},
{
"cell_type": "code",
"collapsed": false,
"input": "yhat = around(xr_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.417664670659\n"
}
],
"prompt_number": 155
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Not so great. One last shot - let's try KNN"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from sklearn.neighbors import KNeighborsClassifier as KNC, KNeighborsRegressor as KNR\n\n#knn classifier\nknc = KNC(n_neighbors=101, weights='distance')\nknc.fit(Xtrn, ytrn)\nprint knc.score(Xtst, ytst)\n\n#knn regressor\nknr = KNR(n_neighbors=101, weights='distance')\nknr.fit(Xtrn, ytrn)\nyhat = around(knr.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "0.40119760479\n0.384730538922"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
}
],
"prompt_number": 182
},
{
"cell_type": "markdown",
"metadata": {},
"source": "It's interesting that the classifier does better than the regressor here.\n\n## Summary ##\n\nGradient boosting methods performed the best, followed by support vector regression and random forests. This is a fairly hard prediction problem, the best raw accuracy metric (which should't be used on this problem, but sue me) was < 50%."
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 191
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 191
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment