Created
December 7, 2013 04:56
-
-
Save dhammack/7837470 to your computer and use it in GitHub Desktop.
A late-night quick rundown of some different classifiers on the UCI starcraft dataset.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "Untitled9" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Starcraft data analysis #\n\nLet's see how well we can do on this starcraft dataset\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "%pylab inline\nX = loadtxt('skill-clean.csv', delimiter=',')\nprint X.shape\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Populating the interactive namespace from numpy and matplotlib\n(3338L, 20L)" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 29 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "What did it do?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "print X[0]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "[ 6.10000000e+01 1.00000000e+00 2.10000000e+01 8.00000000e+00\n 2.40000000e+02 4.69962000e+01 8.20114000e-04 1.68517000e-04\n 6.00000000e+00 0.00000000e+00 4.49000000e-05 1.98849600e-03\n 9.40227000e+01 9.05311000e+01 4.10170000e+00 1.50000000e+01\n 5.72960000e-04 5.00000000e+00 0.00000000e+00 0.00000000e+00]\n" | |
} | |
], | |
"prompt_number": 30 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "It floatified everything. Let's get our target out of there." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "y = array(X[:,1],dtype='int32')\nprint y.shape, y[0:10]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "(3338L,) [1 1 1 1 1 1 1 1 1 1]\n" | |
} | |
], | |
"prompt_number": 31 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Let's remove the target and index from each record." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "X = X[:,2:]\nprint X.shape, X[0,:]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "(3338L, 18L) [ 2.10000000e+01 8.00000000e+00 2.40000000e+02 4.69962000e+01\n 8.20114000e-04 1.68517000e-04 6.00000000e+00 0.00000000e+00\n 4.49000000e-05 1.98849600e-03 9.40227000e+01 9.05311000e+01\n 4.10170000e+00 1.50000000e+01 5.72960000e-04 5.00000000e+00\n 0.00000000e+00 0.00000000e+00]\n" | |
} | |
], | |
"prompt_number": 32 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Time to split into training and test." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "split = X.shape[0] * 0.8\ninds = permutation(range(X.shape[0]))\nX = X[inds,:]\ny = y[inds]\nXtrn, Xtst = X[:split,:], X[split:,:]\nytrn, ytst = y[:split], y[split:]\nprint Xtrn.shape, Xtst.shape", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "(2670L, 18L) (668L, 18L)\n" | |
} | |
], | |
"prompt_number": 33 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Let's do some scaling by feature." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "mean, upper, lower = Xtrn.mean(axis=0), Xtrn.max(axis=0), Xtrn.min(axis=0)\nXtrn = (Xtrn-mean)/(upper-lower)\nXtst = (Xtst-mean)/(upper-lower)\nprint Xtrn.shape, Xtst.shape", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "(2670L, 18L) (668L, 18L)\n" | |
} | |
], | |
"prompt_number": 41 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Ok, let's throw some classifiers and regressors at it." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import sklearn\nfrom sklearn.svm import SVC\n\nsvm_model = SVC(kernel='rbf', C=0.5)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.390718562874\n" | |
} | |
], | |
"prompt_number": 42 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Wow, pretty horrible." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "svm_model = SVC(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.411676646707\n" | |
} | |
], | |
"prompt_number": 200 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Not a big improvement. Maybe if we tell it that we're doing regression..." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from sklearn.svm import SVR\nsvm_model = SVR(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.569161678222\n" | |
} | |
], | |
"prompt_number": 211 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "That's the $R^2$ value, so it's not the same as classification. Let's see how well it classifies..." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "yhat = around(svm_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.422155688623\n" | |
} | |
], | |
"prompt_number": 212 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Not a great improvement. Are we overfitting a lot?" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "print 'in sample', svm_model.score(Xtrn,ytrn)\nprint 'out of sample', svm_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "in sample " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.546991652417\nout of sample " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.569161678222\n" | |
} | |
], | |
"prompt_number": 213 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Maybe this is just a tough problem. Let throw some tree models at it and then call it quits." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from sklearn.ensemble import GradientBoostingRegressor as GBR\ngb_model = GBR()\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.632937458544\n" | |
} | |
], | |
"prompt_number": 53 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Hey there! Let's see what the classification accuracy is" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.417664670659\n" | |
} | |
], | |
"prompt_number": 54 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Well that's disappointing, I was expecting more. Let's try tuning the parameters a little." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "gb_model = GBR(n_estimators=100, max_depth=3, min_samples_split=100, min_samples_leaf=100,\n subsample=0.5, learning_rate=0.1)\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.645185998476\n" | |
} | |
], | |
"prompt_number": 106 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.440119760479\n" | |
} | |
], | |
"prompt_number": 107 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Not bad, it helped a little. How about a random forest" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from sklearn.ensemble import RandomForestRegressor as RFR\nrf_model = RFR(n_estimators=100, max_features='sqrt', max_depth=3,\n min_samples_split=10, min_samples_leaf=10, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.554587772605\n" | |
} | |
], | |
"prompt_number": 108 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "And I'll do a little parameter tuning." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "rf_model = RFR(n_estimators=200, max_features='auto', max_depth=15,\n min_samples_split=1, min_samples_leaf=1, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.623351946439\n" | |
} | |
], | |
"prompt_number": 132 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "yhat = around(rf_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.422155688623\n" | |
} | |
], | |
"prompt_number": 133 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Not able to be the GBM. Let's try with extremely randomized trees" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from sklearn.ensemble import ExtraTreesRegressor as XR\nxr_model = XR(n_estimators=10, max_depth=5, min_samples_split=2,\n min_samples_leaf=2, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.559458645933\n" | |
} | |
], | |
"prompt_number": 135 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "And a little parameter tuning" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "xr_model = XR(n_estimators=100, max_depth=20, min_samples_split=1,\n min_samples_leaf=10, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst), xr_model.score(Xtrn, ytrn)\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.613048652136 " | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.714282553314\n" | |
} | |
], | |
"prompt_number": 154 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "yhat = around(xr_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.417664670659\n" | |
} | |
], | |
"prompt_number": 155 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Not so great. One last shot - let's try KNN" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "from sklearn.neighbors import KNeighborsClassifier as KNC, KNeighborsRegressor as KNR\n\n#knn classifier\nknc = KNC(n_neighbors=101, weights='distance')\nknc.fit(Xtrn, ytrn)\nprint knc.score(Xtst, ytst)\n\n#knn regressor\nknr = KNR(n_neighbors=101, weights='distance')\nknr.fit(Xtrn, ytrn)\nyhat = around(knr.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "0.40119760479\n0.384730538922" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\n" | |
} | |
], | |
"prompt_number": 182 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "It's interesting that the classifier does better than the regressor here.\n\n## Summary ##\n\nGradient boosting methods performed the best, followed by support vector regression and random forests. This is a fairly hard prediction problem, the best raw accuracy metric (which should't be used on this problem, but sue me) was < 50%." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 191 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 191 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment