dhammack · December 7, 2013 04:56
diff --git a/starcraft predictions b/starcraft predictions
 {
 "metadata": {
  "name": "Untitled9"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "# Starcraft data analysis #\n\nLet's see how well we can do on this starcraft dataset\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "%pylab inline\nX = loadtxt('skill-clean.csv', delimiter=',')\nprint X.shape\n",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Populating the interactive namespace from numpy and matplotlib\n(3338L, 20L)"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\n"
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "What did it do?"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print X[0]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "[  6.10000000e+01   1.00000000e+00   2.10000000e+01   8.00000000e+00\n   2.40000000e+02   4.69962000e+01   8.20114000e-04   1.68517000e-04\n   6.00000000e+00   0.00000000e+00   4.49000000e-05   1.98849600e-03\n   9.40227000e+01   9.05311000e+01   4.10170000e+00   1.50000000e+01\n   5.72960000e-04   5.00000000e+00   0.00000000e+00   0.00000000e+00]\n"
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "It floatified everything. Let's get our target out of there."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "y = array(X[:,1],dtype='int32')\nprint y.shape, y[0:10]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "(3338L,) [1 1 1 1 1 1 1 1 1 1]\n"
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Let's remove the target and index from each record."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "X = X[:,2:]\nprint X.shape, X[0,:]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "(3338L, 18L) [  2.10000000e+01   8.00000000e+00   2.40000000e+02   4.69962000e+01\n   8.20114000e-04   1.68517000e-04   6.00000000e+00   0.00000000e+00\n   4.49000000e-05   1.98849600e-03   9.40227000e+01   9.05311000e+01\n   4.10170000e+00   1.50000000e+01   5.72960000e-04   5.00000000e+00\n   0.00000000e+00   0.00000000e+00]\n"
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Time to split into training and test."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "split = X.shape[0] * 0.8\ninds = permutation(range(X.shape[0]))\nX = X[inds,:]\ny = y[inds]\nXtrn, Xtst = X[:split,:], X[split:,:]\nytrn, ytst = y[:split], y[split:]\nprint Xtrn.shape, Xtst.shape",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "(2670L, 18L) (668L, 18L)\n"
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Let's do some scaling by feature."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "mean, upper, lower = Xtrn.mean(axis=0), Xtrn.max(axis=0), Xtrn.min(axis=0)\nXtrn = (Xtrn-mean)/(upper-lower)\nXtst = (Xtst-mean)/(upper-lower)\nprint Xtrn.shape, Xtst.shape",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "(2670L, 18L) (668L, 18L)\n"
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Ok, let's throw some classifiers and regressors at it."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import sklearn\nfrom sklearn.svm import SVC\n\nsvm_model = SVC(kernel='rbf', C=0.5)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.390718562874\n"
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Wow, pretty horrible."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "svm_model = SVC(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.411676646707\n"
      }
     ],
     "prompt_number": 200
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Not a big improvement. Maybe if we tell it that we're doing regression..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from sklearn.svm import SVR\nsvm_model = SVR(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.569161678222\n"
      }
     ],
     "prompt_number": 211
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "That's the $R^2$ value, so it's not the same as classification. Let's see how well it classifies..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "yhat = around(svm_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.422155688623\n"
      }
     ],
     "prompt_number": 212
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Not a great improvement. Are we overfitting a lot?"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print 'in sample', svm_model.score(Xtrn,ytrn)\nprint 'out of sample', svm_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "in sample "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.546991652417\nout of sample "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.569161678222\n"
      }
     ],
     "prompt_number": 213
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Maybe this is just a tough problem. Let throw some tree models at it and then call it quits."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from sklearn.ensemble import GradientBoostingRegressor as GBR\ngb_model = GBR()\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.632937458544\n"
      }
     ],
     "prompt_number": 53
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Hey there! Let's see what the classification accuracy is"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.417664670659\n"
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Well that's disappointing, I was expecting more. Let's try tuning the parameters a little."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "gb_model = GBR(n_estimators=100, max_depth=3, min_samples_split=100, min_samples_leaf=100,\n               subsample=0.5, learning_rate=0.1)\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.645185998476\n"
      }
     ],
     "prompt_number": 106
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.440119760479\n"
      }
     ],
     "prompt_number": 107
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Not bad, it helped a little. How about a random forest"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from sklearn.ensemble import RandomForestRegressor as RFR\nrf_model = RFR(n_estimators=100, max_features='sqrt', max_depth=3,\n               min_samples_split=10, min_samples_leaf=10, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.554587772605\n"
      }
     ],
     "prompt_number": 108
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "And I'll do a little parameter tuning."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "rf_model = RFR(n_estimators=200, max_features='auto', max_depth=15,\n               min_samples_split=1, min_samples_leaf=1, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.623351946439\n"
      }
     ],
     "prompt_number": 132
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "yhat = around(rf_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.422155688623\n"
      }
     ],
     "prompt_number": 133
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Not able to be the GBM. Let's try with extremely randomized trees"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from sklearn.ensemble import ExtraTreesRegressor as XR\nxr_model = XR(n_estimators=10, max_depth=5, min_samples_split=2,\n              min_samples_leaf=2, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.559458645933\n"
      }
     ],
     "prompt_number": 135
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "And a little parameter tuning"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "xr_model = XR(n_estimators=100, max_depth=20, min_samples_split=1,\n              min_samples_leaf=10, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst), xr_model.score(Xtrn, ytrn)\n",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.613048652136 "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.714282553314\n"
      }
     ],
     "prompt_number": 154
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "yhat = around(xr_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.417664670659\n"
      }
     ],
     "prompt_number": 155
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Not so great. One last shot - let's try KNN"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from sklearn.neighbors import KNeighborsClassifier as KNC, KNeighborsRegressor as KNR\n\n#knn classifier\nknc = KNC(n_neighbors=101, weights='distance')\nknc.fit(Xtrn, ytrn)\nprint knc.score(Xtst, ytst)\n\n#knn regressor\nknr = KNR(n_neighbors=101, weights='distance')\nknr.fit(Xtrn, ytrn)\nyhat = around(knr.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.40119760479\n0.384730538922"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\n"
      }
     ],
     "prompt_number": 182
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "It's interesting that the classifier does better than the regressor here.\n\n## Summary ##\n\nGradient boosting methods performed the best, followed by support vector regression and random forests. This is a fairly hard prediction problem, the best raw accuracy metric (which should't be used on this problem, but sue me) was < 50%."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 191
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 191
    }
   ],
   "metadata": {}
  }
 ]
 }
	{
	"metadata": {
	"name": "Untitled9"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "# Starcraft data analysis #\n\nLet's see how well we can do on this starcraft dataset\n"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "%pylab inline\nX = loadtxt('skill-clean.csv', delimiter=',')\nprint X.shape\n",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "Populating the interactive namespace from numpy and matplotlib\n(3338L, 20L)"
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\n"
	}
	],
	"prompt_number": 29
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "What did it do?"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "print X[0]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "[ 6.10000000e+01 1.00000000e+00 2.10000000e+01 8.00000000e+00\n 2.40000000e+02 4.69962000e+01 8.20114000e-04 1.68517000e-04\n 6.00000000e+00 0.00000000e+00 4.49000000e-05 1.98849600e-03\n 9.40227000e+01 9.05311000e+01 4.10170000e+00 1.50000000e+01\n 5.72960000e-04 5.00000000e+00 0.00000000e+00 0.00000000e+00]\n"
	}
	],
	"prompt_number": 30
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "It floatified everything. Let's get our target out of there."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "y = array(X[:,1],dtype='int32')\nprint y.shape, y[0:10]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "(3338L,) [1 1 1 1 1 1 1 1 1 1]\n"
	}
	],
	"prompt_number": 31
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Let's remove the target and index from each record."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "X = X[:,2:]\nprint X.shape, X[0,:]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "(3338L, 18L) [ 2.10000000e+01 8.00000000e+00 2.40000000e+02 4.69962000e+01\n 8.20114000e-04 1.68517000e-04 6.00000000e+00 0.00000000e+00\n 4.49000000e-05 1.98849600e-03 9.40227000e+01 9.05311000e+01\n 4.10170000e+00 1.50000000e+01 5.72960000e-04 5.00000000e+00\n 0.00000000e+00 0.00000000e+00]\n"
	}
	],
	"prompt_number": 32
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Time to split into training and test."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "split = X.shape[0] * 0.8\ninds = permutation(range(X.shape[0]))\nX = X[inds,:]\ny = y[inds]\nXtrn, Xtst = X[:split,:], X[split:,:]\nytrn, ytst = y[:split], y[split:]\nprint Xtrn.shape, Xtst.shape",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "(2670L, 18L) (668L, 18L)\n"
	}
	],
	"prompt_number": 33
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Let's do some scaling by feature."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "mean, upper, lower = Xtrn.mean(axis=0), Xtrn.max(axis=0), Xtrn.min(axis=0)\nXtrn = (Xtrn-mean)/(upper-lower)\nXtst = (Xtst-mean)/(upper-lower)\nprint Xtrn.shape, Xtst.shape",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "(2670L, 18L) (668L, 18L)\n"
	}
	],
	"prompt_number": 41
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Ok, let's throw some classifiers and regressors at it."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import sklearn\nfrom sklearn.svm import SVC\n\nsvm_model = SVC(kernel='rbf', C=0.5)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.390718562874\n"
	}
	],
	"prompt_number": 42
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Wow, pretty horrible."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "svm_model = SVC(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.411676646707\n"
	}
	],
	"prompt_number": 200
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Not a big improvement. Maybe if we tell it that we're doing regression..."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from sklearn.svm import SVR\nsvm_model = SVR(kernel='rbf', C=10.)\nsvm_model.fit(Xtrn,ytrn)\nprint svm_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.569161678222\n"
	}
	],
	"prompt_number": 211
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "That's the $R^2$ value, so it's not the same as classification. Let's see how well it classifies..."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "yhat = around(svm_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.422155688623\n"
	}
	],
	"prompt_number": 212
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Not a great improvement. Are we overfitting a lot?"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "print 'in sample', svm_model.score(Xtrn,ytrn)\nprint 'out of sample', svm_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "in sample "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.546991652417\nout of sample "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.569161678222\n"
	}
	],
	"prompt_number": 213
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Maybe this is just a tough problem. Let throw some tree models at it and then call it quits."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from sklearn.ensemble import GradientBoostingRegressor as GBR\ngb_model = GBR()\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.632937458544\n"
	}
	],
	"prompt_number": 53
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Hey there! Let's see what the classification accuracy is"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.417664670659\n"
	}
	],
	"prompt_number": 54
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Well that's disappointing, I was expecting more. Let's try tuning the parameters a little."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "gb_model = GBR(n_estimators=100, max_depth=3, min_samples_split=100, min_samples_leaf=100,\n subsample=0.5, learning_rate=0.1)\ngb_model.fit(Xtrn, ytrn)\nprint gb_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.645185998476\n"
	}
	],
	"prompt_number": 106
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "yhat = around(gb_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.440119760479\n"
	}
	],
	"prompt_number": 107
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Not bad, it helped a little. How about a random forest"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from sklearn.ensemble import RandomForestRegressor as RFR\nrf_model = RFR(n_estimators=100, max_features='sqrt', max_depth=3,\n min_samples_split=10, min_samples_leaf=10, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.554587772605\n"
	}
	],
	"prompt_number": 108
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "And I'll do a little parameter tuning."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "rf_model = RFR(n_estimators=200, max_features='auto', max_depth=15,\n min_samples_split=1, min_samples_leaf=1, n_jobs=-1)\n\nrf_model.fit(Xtrn, ytrn)\nprint rf_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.623351946439\n"
	}
	],
	"prompt_number": 132
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "yhat = around(rf_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.422155688623\n"
	}
	],
	"prompt_number": 133
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Not able to be the GBM. Let's try with extremely randomized trees"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from sklearn.ensemble import ExtraTreesRegressor as XR\nxr_model = XR(n_estimators=10, max_depth=5, min_samples_split=2,\n min_samples_leaf=2, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.559458645933\n"
	}
	],
	"prompt_number": 135
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "And a little parameter tuning"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "xr_model = XR(n_estimators=100, max_depth=20, min_samples_split=1,\n min_samples_leaf=10, bootstrap=False, n_jobs=-1)\n\nxr_model.fit(Xtrn, ytrn)\nprint xr_model.score(Xtst, ytst), xr_model.score(Xtrn, ytrn)\n",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.613048652136 "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.714282553314\n"
	}
	],
	"prompt_number": 154
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "yhat = around(xr_model.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.417664670659\n"
	}
	],
	"prompt_number": 155
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Not so great. One last shot - let's try KNN"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "from sklearn.neighbors import KNeighborsClassifier as KNC, KNeighborsRegressor as KNR\n\n#knn classifier\nknc = KNC(n_neighbors=101, weights='distance')\nknc.fit(Xtrn, ytrn)\nprint knc.score(Xtst, ytst)\n\n#knn regressor\nknr = KNR(n_neighbors=101, weights='distance')\nknr.fit(Xtrn, ytrn)\nyhat = around(knr.predict(Xtst), decimals=0)\nprint sum(equal(yhat,ytst)) / float(yhat.shape[0])",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.40119760479\n0.384730538922"
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\n"
	}
	],
	"prompt_number": 182
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "It's interesting that the classifier does better than the regressor here.\n\n## Summary ##\n\nGradient boosting methods performed the best, followed by support vector regression and random forests. This is a fairly hard prediction problem, the best raw accuracy metric (which should't be used on this problem, but sue me) was < 50%."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 191
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 191
	}
	],
	"metadata": {}
	}
	]
	}