Created
March 15, 2018 16:16
-
-
Save cdeweyx/d185c0076c41957c3baf00e51f3c9bff to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Kaggle March Madness Challenge 2018" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"Google Cloud and NCAA® have teamed up to bring you this year’s version of the Kaggle machine learning competition. Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness® during this year's NCAA Division I Men’s and Women’s Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAA’s historical data and your computing power, while the ground truth unfolds on national television." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- Challenge Home: https://www.kaggle.com/c/mens-machine-learning-competition-2018\n", | |
"\n", | |
"\n", | |
"\n", | |
"- Basic Logistic Regression Starter Kernel: https://www.kaggle.com/osciiart/basic-starter-kernel-ncaa-men-s-dataset-with-jp\n", | |
"\n", | |
"\n", | |
"- Least Squares Starter Kernel: https://www.kaggle.com/baeng72/basic-least-squares-ratings\n", | |
"\n", | |
"\n", | |
"- NCAA Tournaments Competition Walkthrough: https://www.kaggle.com/asindico/ncaa-tournaments-competition-walkthrough\n", | |
"\n", | |
"\n", | |
"- Basic Starter Kernel: https://www.kaggle.com/juliaelliott/basic-starter-kernel-ncaa-men-s-dataset\n", | |
"\n", | |
"\n", | |
"- Feature Engineering with Advanced Statistics: https://www.kaggle.com/lnatml/feature-engineering-with-advanced-stats\n", | |
"\n", | |
"\n", | |
"- FiveThirtyEight Elo Ratings: https://www.kaggle.com/lpkirwin/fivethirtyeight-elo-ratings\n", | |
"\n", | |
"\n", | |
"- Extensive NCAA Exploratory Analysis: https://www.kaggle.com/captcalculator/a-very-extensive-ncaa-exploratory-analysis\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preparation\n", | |
"Import packages and load in initial datasets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"import pickle\n", | |
"\n", | |
"from sklearn.linear_model import LogisticRegression\n", | |
"from sklearn.utils import shuffle\n", | |
"from sklearn.model_selection import GridSearchCV\n", | |
"from sklearn.model_selection import train_test_split\n", | |
"from sklearn.metrics import log_loss\n", | |
"from sklearn import preprocessing\n", | |
"from sklearn import model_selection \n", | |
"from sklearn.metrics import confusion_matrix\n", | |
"from sklearn.metrics import classification_report\n", | |
"\n", | |
"import matplotlib.pyplot as plt\n", | |
"%matplotlib inline\n", | |
"\n", | |
"import warnings\n", | |
"warnings.filterwarnings('ignore')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_seeds = pd.read_csv(data_dir + 'NCAATourneySeeds.csv')\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>Seed</th>\n", | |
" <th>TeamID</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>W01</td>\n", | |
" <td>1207</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>W02</td>\n", | |
" <td>1210</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>W03</td>\n", | |
" <td>1228</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>W04</td>\n", | |
" <td>1260</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>W05</td>\n", | |
" <td>1374</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season Seed TeamID\n", | |
"0 1985 W01 1207\n", | |
"1 1985 W02 1210\n", | |
"2 1985 W03 1228\n", | |
"3 1985 W04 1260\n", | |
"4 1985 W05 1374" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_seeds.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>WScore</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>LScore</th>\n", | |
" <th>WLoc</th>\n", | |
" <th>NumOT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1116</td>\n", | |
" <td>63</td>\n", | |
" <td>1234</td>\n", | |
" <td>54</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1120</td>\n", | |
" <td>59</td>\n", | |
" <td>1345</td>\n", | |
" <td>58</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1207</td>\n", | |
" <td>68</td>\n", | |
" <td>1250</td>\n", | |
" <td>43</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1229</td>\n", | |
" <td>58</td>\n", | |
" <td>1425</td>\n", | |
" <td>55</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1242</td>\n", | |
" <td>49</td>\n", | |
" <td>1325</td>\n", | |
" <td>38</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT\n", | |
"0 1985 136 1116 63 1234 54 N 0\n", | |
"1 1985 136 1120 59 1345 58 N 0\n", | |
"2 1985 136 1207 68 1250 43 N 0\n", | |
"3 1985 136 1229 58 1425 55 N 0\n", | |
"4 1985 136 1242 49 1325 38 N 0" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_tour.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Seed Based Logistic Regression\n", | |
"Using just seeding the predict winner and confidence, use this as baseline model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>Seed</th>\n", | |
" <th>TeamID</th>\n", | |
" <th>Seed_int</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>W01</td>\n", | |
" <td>1207</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>W02</td>\n", | |
" <td>1210</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>W03</td>\n", | |
" <td>1228</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>W04</td>\n", | |
" <td>1260</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>W05</td>\n", | |
" <td>1374</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season Seed TeamID Seed_int\n", | |
"0 1985 W01 1207 1\n", | |
"1 1985 W02 1210 2\n", | |
"2 1985 W03 1228 3\n", | |
"3 1985 W04 1260 4\n", | |
"4 1985 W05 1374 5" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Convert seed to int\n", | |
"df_seeds['Seed_int'] = df_seeds['Seed'].str[1:3]\n", | |
"df_seeds['Seed_int'] = df_seeds['Seed_int'].apply(pd.to_numeric)\n", | |
"df_seeds.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Drop unnecessary columns\n", | |
"df_seeds.drop(labels=['Seed'], inplace=True, axis=1)\n", | |
"df_tour.drop(labels=['DayNum', 'WScore', 'LScore', 'WLoc', 'NumOT'], inplace=True, axis=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>TeamID</th>\n", | |
" <th>Seed_int</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>1207</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>1210</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>1228</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>1260</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>1374</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season TeamID Seed_int\n", | |
"0 1985 1207 1\n", | |
"1 1985 1210 2\n", | |
"2 1985 1228 3\n", | |
"3 1985 1260 4\n", | |
"4 1985 1374 5" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_seeds.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>1116</td>\n", | |
" <td>1234</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>1120</td>\n", | |
" <td>1345</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>1207</td>\n", | |
" <td>1250</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>1229</td>\n", | |
" <td>1425</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>1242</td>\n", | |
" <td>1325</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season WTeamID LTeamID\n", | |
"0 1985 1116 1234\n", | |
"1 1985 1120 1345\n", | |
"2 1985 1207 1250\n", | |
"3 1985 1229 1425\n", | |
"4 1985 1242 1325" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_tour.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>WSeed</th>\n", | |
" <th>LSeed</th>\n", | |
" <th>SeedDiff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>1116</td>\n", | |
" <td>1234</td>\n", | |
" <td>9</td>\n", | |
" <td>8</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>1120</td>\n", | |
" <td>1345</td>\n", | |
" <td>11</td>\n", | |
" <td>6</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>1207</td>\n", | |
" <td>1250</td>\n", | |
" <td>1</td>\n", | |
" <td>16</td>\n", | |
" <td>-15</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>1229</td>\n", | |
" <td>1425</td>\n", | |
" <td>9</td>\n", | |
" <td>8</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>1242</td>\n", | |
" <td>1325</td>\n", | |
" <td>3</td>\n", | |
" <td>14</td>\n", | |
" <td>-11</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season WTeamID LTeamID WSeed LSeed SeedDiff\n", | |
"0 1985 1116 1234 9 8 1\n", | |
"1 1985 1120 1345 11 6 5\n", | |
"2 1985 1207 1250 1 16 -15\n", | |
"3 1985 1229 1425 9 8 1\n", | |
"4 1985 1242 1325 3 14 -11" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Merge dataframes\n", | |
"df_winseeds = df_seeds.rename(columns={'TeamID':'WTeamID', 'Seed_int':'WSeed'})\n", | |
"df_lossseeds = df_seeds.rename(columns={'TeamID':'LTeamID', 'Seed_int':'LSeed'})\n", | |
"df_dummy = pd.merge(left=df_tour, right=df_winseeds, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_lossseeds, on=['Season', 'LTeamID'])\n", | |
"df_concat['SeedDiff'] = df_concat.WSeed - df_concat.LSeed\n", | |
"df_concat.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>SeedDiff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>5</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>-15</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>-11</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" SeedDiff Result\n", | |
"0 1 1\n", | |
"1 5 1\n", | |
"2 -15 1\n", | |
"3 1 1\n", | |
"4 -11 1" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Create training data set\n", | |
"df_wins = pd.DataFrame()\n", | |
"df_wins['SeedDiff'] = df_concat['SeedDiff']\n", | |
"df_wins['Result'] = 1\n", | |
"\n", | |
"df_losses = pd.DataFrame()\n", | |
"df_losses['SeedDiff'] = -df_concat['SeedDiff']\n", | |
"df_losses['Result'] = 0\n", | |
"\n", | |
"df_predictions = pd.concat((df_wins, df_losses))\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train = df_predictions['SeedDiff'].values.reshape(-1,1)\n", | |
"Y_train = df_predictions['Result'].values\n", | |
"X_train, Y_train = shuffle(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Best log_loss: -0.5532, with best C: 0.021544346900318846\n" | |
] | |
} | |
], | |
"source": [ | |
"# Create and test model\n", | |
"logreg = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf.fit(X_train, Y_train)\n", | |
"print('Best log_loss: {:.4}, with best C: {}'.format(clf.best_score_, clf.best_params_['C']))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Type</th>\n", | |
" <th>Log Loss</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Seed Based Logistic Regression</td>\n", | |
" <td>-0.55315</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Type Log Loss\n", | |
"0 Seed Based Logistic Regression -0.55315" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Store model results\n", | |
"df_results = pd.DataFrame({'Type': ['Seed Based Logistic Regression'], 'Log Loss': [clf.best_score_]}, columns=['Type', 'Log Loss'])\n", | |
"df_results.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEKCAYAAAD9xUlFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAIABJREFUeJzt3Xd8VFX6x/HPkwQINfSOBBBEUGqQ\n3qyACqiogBULIIKC6+7q7v7UVbfougoIFuwVFBcUUVGRIl1C7xCKEGlBqdIEzu+Pe4ljTBlIZibl\n+3695pVbztz7zJ3JPHPPPfccc84hIiICEBXpAEREJPdQUhARkVRKCiIikkpJQUREUikpiIhIKiUF\nERFJpaQgIiKplBRERCSVkoKIiKSKiXQAZ6p8+fIuPj4+0mGIiOQpixYt2uOcq5BVuTyXFOLj40lM\nTIx0GCIieYqZfR9MOVUfiYhIKiUFERFJpaQgIiKplBRERCSVkoKIiKRSUhARkVRKCiIikqrAJIWV\nP+xn5Dcb+PHQsUiHIiKSaxWYpDA7aQ/Pfr2e1v+exp8/Ws7anQciHZKISK6T5+5oPlsDO9bh0vMr\n8vqcLUxYnMwHidtod2557mgXT6d6FYmKskiHKCISceaci3QMZyQhIcFlt5uLvT8f5/3vtvL2vC3s\nOnCM2uWL069tPNc1r06xwgUmT4pIAWJmi5xzCVmWK4hJ4bRfTp7i8xU7eG32ZpYn76dUbAx9Wp7D\nba3jqVq6aI7sQ0QkN1BSOAPOORZ9v5fX52xmysqdmBldL6jMne1q0fScMjm6LxGRSAg2KaiuBDAz\nEuLLkhBflm0/HeatuVv4YOE2Ji/fQdNzSnNnu1p0aViZmOgCc11eRAoonSlk4NCxE3yUuI035m7h\n+x8PUzUultvbxtO3ZU1KFFEuFZG8RdVHOeTkKce0tbt5bfYm5m/6idLFCnFH21rc1iaeuKKFwhaH\niEh2KCmEwJKtexk1LYlv1u6mZGwM/drEc0e7WpQuVjgi8YiIBCvYpBDSSnIz62Jm68wsycweSmf9\nOWY23cyWmNlyM+sWyniyq+k5ZXjt9hZMHtKONnXKMXJaEm3/PY2npqzVndIiki+E7EzBzKKB9cBl\nQDKwEOjjnFsdUGYMsMQ596KZNQA+d87FZ7bdSJ4ppLV25wFGTUvisxU7iI2J5qaW59C/Q20qloqN\ndGgiIr+RG84ULgKSnHObnHPHgXFAjzRlHFDKn44DtocwnhxXv3IpRvVtxtfDOtL1gsq8MXcL7Z6e\nzqOfrGTH/iORDk9E5IyFMilUA7YFzCf7ywI9BtxsZsnA58CQEMYTMudWLMGzNzZh2h86ck2Tary3\nYCsdn57BXyauYNtPhyMdnohI0EKZFNLrTChtXVUf4E3nXHWgG/COmf0uJjPrb2aJZpaYkpISglBz\nRs1yxXmqVyNm/LETN7SozkeJyXR+ZgZ/HL+MLXt+jnR4IiJZCuU1hdbAY865K/z5hwGcc/8KKLMK\n6OKc2+bPbwJaOed2Z7Td3HRNISs79x/lpZkbGfvdVn45eYoeTaox7NJ6nFOuWKRDE5ECJjdcU1gI\n1DWzWmZWGOgNTEpTZitwCYCZnQ/EArn3VOAMVY6L5bHuDZn1587c1b42X6zcwSXPzuCxSavUWklE\ncqWQ3qfgNzEdDkQDrzvn/mFmjwOJzrlJfoujV4ASeFVLf3LOfZXZNvPSmUJauw4cZfjUDXyYuI3Y\nmCgGdKzDne1qUVx3SItIiOnmtVwsafchnvlyHVNW7aR8iSLcf2ldereoQSH1rSQiIZIbqo8kA+dW\nLMFLtzRnwqA21C5fnP/7eCWXPTuTz5bvIK8laRHJX5QUIqjZOWX4YEArXr89gSIx0dz7/mJ6jp7D\n3I17Ih2aiBRQSgoRZmZcXL8Sn9/fnmeub0zKwWP0fWUBt73+Hau3axxpEQkvXVPIZY7+cpK3521h\n9PSNHDj6Cz2bVOOBy+pRo6yasYrI2dOF5jxu/+FfeHHmRt6Ysxnn4OZWNRl88bmULa4eWUXkzCkp\n5BM79h9h+NcbGL9oG8ULx3D/pXW5rU28WiqJyBlR66N8okpcUZ7q1YgpQzvQrGYZnvxsDd1GzGJO\nki5Gi0jOU1LII+pVKsmb/Vrwyq0JHD1xkpteXcDAdxaRvFcd7olIzlFSyEPMjMsaVOLrYR35w2X1\nmLF+N5f8dybDp67n6C8nIx2eiOQDSgp5UGyhaIZcUpdv/tCJSxtUYvjUDVz67EymrNypm99EJFuU\nFPKwaqWLMrpvM96/uyXFC8cw8N1F3PLadyTtPhjp0EQkj1JSyAfa1CnPZ/e149GrG7AseR9dhs/i\nycmrOXj0l0iHJiJ5jJJCPhETHUW/trWY/mAnejWvzmtzNtP5mZl8tCiZU6dUpSQiwVFSyGfKlyjC\nv69rxMeD2lK9TFEeHL+M616ay/LkfZEOTUTyACWFfKpxjdJMuKcN/+nViG0/HabH6Dk89L/l7Dt8\nPNKhiUgupqSQj0VFGdcn1GDag524s20txi9K5tJnZzJp2Xa1UhKRdCkpFAClYgvxt6saMGlwW6qW\nLsp9Y5dwx5sL+WHfkUiHJiK5jJJCAdKwahwTB7Xlb1eez/xNP3HZszN5ffZmTupCtIj4lBQKmOgo\n4672tflqWAdaxJfl8cmrufbFuazZobEbRERJocCqUbYYb/ZrwYjeTUj+6TBXPz+bp6asVXcZIgWc\nkkIBZmb0aFKNqQ90pGfTarw4YyNdhn/LXPXAKlJghTQpmFkXM1tnZklm9lA6658zs6X+Y72ZqTF9\nBJQpXphnrm/Me3e1xAF9X13AH8cvU/NVkQIoZIPsmFk0sB64DEgGFgJ9nHOrMyg/BGjqnLsjs+0W\ntEF2wu3I8ZOM+GYDr8zaRJlihXjk6oZc3agKZhbp0EQkG3LDIDsXAUnOuU3OuePAOKBHJuX7AGND\nGI8EoWjhaB7qWv83zVfvfCtRzVdFCohQJoVqwLaA+WR/2e+YWU2gFjAthPHIGQhsvjpv449qvipS\nQIQyKaRX35DRN0pv4CPnXLpNX8ysv5klmlliSkpKjgUomUuv+WqfV+az9UeN9iaSX4UyKSQDNQLm\nqwPbMyjbm0yqjpxzY5xzCc65hAoVKuRgiBKM081Xn7m+MWu2H6DLiG95b8H36ipDJB8KZVJYCNQ1\ns1pmVhjvi39S2kJmdh5QBpgXwlgkm8yMXs2r8+WwDjQ7pwx/nbiSW1//jh37da1BJD8JWVJwzp0A\nBgNfAmuAD51zq8zscTPrHlC0DzDO6WdnnlC1dFHevuMinujRkMQte7n8uW+ZsDhZZw0i+UTImqSG\nipqk5h5b9vzMg+OXkfj9Xq5oWIl/XHMh5UsUiXRYIpKO3NAkVfK5+PLF+WBAa/7SrT7T16Zw+XPf\nMmXljkiHJSLZoKQg2RIdZfTvUIfJ97WjaulYBr67mKHjlrD/sMaHFsmLlBQkR9SrVJKJg9oy9NK6\nTF6+g8uHz2TGut2RDktEzpCSguSYQtFRDL20HhMHtSWuaCFuf2MhD09YwaFjJyIdmogESUlBctyF\n1eOYNLgdAzrWZtzCrXQZ/i3zN/0Y6bBEJAhKChISsYWiebjr+Ywf0JroKKP3mPk8/ulqjdcgkssp\nKUhIJcSX5Yv723NLq5q8PmczPUfPYd3Og5EOS0QyoKQgIVescAxP9LyAN/q1YM+hY3QfNZt35m3R\nDW8iuZCSgoRN5/Mq8sX9HWhdpxz/98kq7n47kR8PHYt0WCISQElBwqpCySK8cXsLHr26Ad+u30OX\nEbOYtUE934rkFkoKEnZmRr+2tfj43raULlqIW177jn9+vobjJ05FOjSRAk9JQSKmQdVSTBrcjptb\nncOYbzdx7Ytz2JhyKNJhiRRoSgoSUUULR/NkzwsZc0tzfth7hKtGzmbcd1t1EVokQpQUJFe4vGFl\npgztQLOapXlowgoGvbeYfYePRzoskQJHSUFyjUqlYnnnjpY83LU+X6/eRdcRs3QntEiYKSlIrhIV\nZQzoWIeJg9oSWyiaPq/M55kv1/HLSV2EFgkHJQXJlS6sHsfkIe24oXkNRk1P4vqX5vH9jz9HOiyR\nfE9JQXKt4kVieKpXI0b3bcamlENcOXI2k5Ztj3RYIvmakoLkelc2qsIXQztQv3JJ7hu7hL9MXKGO\n9URCRElB8oRqpYsytn8r7ulUh/cXbKXnaN3TIBIKSgqSZxSKjuLPXerzRr8W7DpwlO7Pz+aTpT9E\nOiyRfCWkScHMupjZOjNLMrOHMihzg5mtNrNVZvZ+KOOR/KHzeRX5/P72NKhaivvHLeXhCctVnSSS\nQ0KWFMwsGhgNdAUaAH3MrEGaMnWBh4G2zrmGwNBQxSP5S5W4ooy9uxWDOtVh7Hfb6Dl6Dkm7VZ0k\nkl2hPFO4CEhyzm1yzh0HxgE90pS5GxjtnNsL4JzTSO8StJjoKP7UpT5v9mvB7oPeOA0fL1F1kkh2\nhDIpVAO2Bcwn+8sC1QPqmdkcM5tvZl1CGI/kU53Oq8jn97XngqpxDP1gKX/+aDlHjqs6SeRshDIp\nWDrL0vZyFgPUBToBfYBXzaz07zZk1t/MEs0sMSVFfe/L71WOi+X9u1syuPO5fLjodHWShv0UOVOh\nTArJQI2A+epA2juPkoFPnHO/OOc2A+vwksRvOOfGOOcSnHMJFSpUCFnAkrfFREfx4BXn8Va/i9hz\n6BhXPz+H/y1KjnRYInlKUEnBPBea2RVm1sHMygXxtIVAXTOrZWaFgd7ApDRlPgY6+/soj1edtCn4\n8EV+r0O9Cnx+f3saVY/jD+OX8cfxy1SdJBKkmMxWmlk88CegC7AZSAFi8b7s9wEvAe+6dDq/d86d\nMLPBwJdANPC6c26VmT0OJDrnJvnrLjez1cBJ4I/OOXWLKdlWqVQs793VkhHfbGDU9CSWJe9jdN9m\n1K1UMtKhieRqltlgJmb2IfAiMNM5dyrNuirATcAe59yboQwyUEJCgktMTAzX7iQfmLUhhaHjlnL4\n+En+ee0FXNO0eqRDEgk7M1vknEvIslxeG+FKSUHOxu4DRxk8dgnfbf6Jm1qewyNXN6BITHSkwxIJ\nm2CTwllfaDazzmf7XJFwq1gqlvfvasmADrV5b8FWrn9pHtt+OhzpsERyney0Pnorx6IQCYOY6Cge\n7nY+L9/SnM17fuaq52czfa3ulxQJlNWF5gkZrQKCaYEkkutc0bAy9SuXZOC7i+n35kIGdz6XYZfV\nIzoqvVtrRAqWTJMCXnPR24C0Q14Z0CYkEYmEQc1yxZk4qA2PfLKSUdOTWLJtLyN6N6V8iSKRDk0k\norJKCguAg8656WlXmNnG0IQkEh6xhaJ5uldjEmqW5f8+WclVI2cz+qamNK9ZNtKhiURMVtcUuqaX\nEACcczpTkHzhhhY1mDCoDYVjorjx5fm8Nnszea1VnkhOyTQppHdTmkh+1LBqHJ8OaUfn+hV5YvJq\nBr+/hINHf4l0WCJhp5HXRHxxRQsx5pbmPNS1PlNW7aTHqDms26lO9aRgUVIQCWBmDOxYh/fuasnB\nYyfoOXoOE5eoUz0pOJQURNLRqnY5PhvSjgurxzHsg2X8deIKjp1Qp3qS/wXbS2oXM1toZrvN7Ccz\n22tmP4U6OJFI0l3QUhAFe6YwChiAN3JaBaC8/1ckX/vNXdApP3P1qNnMXK+BniT/CjYpJANL/cFw\nTp5+hDIwkdzkioaVmTSkHZVLxXL7G9/x/DcbOHVKjfMk/8nq5rXT/gR8amYzgGOnFzrnRoYiKJHc\nqFb54kwY1Ia/TFjBf79ez9Jt+3j2xibEFS0U6dBEckywZwp/xxsEpzRetdHph0iBUqxwDM/d2ITH\nezRk5voUuo+azertByIdlkiOCfZMoaJzrnlIIxHJI8yMW1vH07BqKQa9t5hrX5zDP6+5kGubafAe\nyfuCPVP4xswuDmkkInlM85plmTykPY2rl+aBD5fxfx+v5PiJU1k/USQXCzYp3A1MNbNDapIq8qsK\nJYvw3l0t6d+hNu/M/54bXp7Hjv1HIh2WyFkLNimUBwoBcahJqshvxERH8Zdu5/PCTc3YsOsgV42c\nzdykPZEOS+SsBJUU/OanJYDGQMuAh4j4ul1YhU8Gt6NM8cLc/NoCXpq5Ub2tSp4T7B3NdwJzgWnA\nU/7ffwbxvC5mts7MkszsoXTW325mKWa21H/cdYbxi+Qq51Yswcf3tqXrBVX49xdrGfjuIvW2KnlK\nsNVHQ4EEYItzrj3QHNiR2RPMLBoYDXQFGgB9zKxBOkU/cM418R+vBh+6SO5UokgMo/o25W9Xns/U\nNbvpMWoO63ept1XJG4JNCkedc0cAzKywc24VUD+L51wEJDnnNjnnjgPjgB5nH6pI3mFm3NW+Nu/f\n1ZIDR0/QY9QcJi3bHumwRLIUbFLYYWalgU+BL83sf8CuLJ5TDdgWMJ/sL0vrOjNbbmYfmVmNIOMR\nyRNa1i7HZ/e1o2HVUtw3dgl//3QVv5xUs1XJvYK90NzdObfPOfd/wJPAe2T9q9/S21Sa+U+BeOdc\nI2Aq8Fa6GzLrb2aJZpaYkqLOyCRvqVQqlrH9W3F7m3jemLOFvq/MZ/eBo5EOSyRdQY+nYGatzOxW\n59w3wEygUhZPSQYCf/lXB35z/uyc+9E5d7ovpVfwrlX8jnNujHMuwTmXUKGCWsJK3lMoOorHujdk\nRO8mrPzhAFc+P5vvNutWH8l9gm199DfgUeBv/qJY4P0snrYQqGtmtcysMNAbmJRmu1UCZrsDa4KJ\nRySv6tGkGh/f25YSRWLo88p8Xp21Sc1WJVcJ9kyhF9AN+BnAOfcDUCqzJzjnTgCDgS/xvuw/dM6t\nMrPHzay7X+w+M1tlZsuA+4Dbz/wliOQt51UuySeD23JJ/Yo8+dkaBo9dws/HTkQ6LBEALJhfKWa2\nwDnX0swWO+eamVkxYL5/LSCsEhISXGJiYrh3K5LjnHO8NHMT//lyLbUrlOClm5tzbsUSkQ5L8ikz\nW+ScS8iqXLBnChPMbDQQZ2b9gK+A17MToEhBZ2bc06kO79zZkp9+Pk6PUbP5YkWmt/+IhFywrY+e\nAibjXRNoDPzDOTc8lIGJFBRtzy3P5CHtOLdSSe55bzH/+nwNJ9RsVSIk0+ojM/vKOXd5GOPJkqqP\nJL86duIkT0xezbvzt9Kqdlme79OMCiWLRDosySdyqvpI7T9FwqRITDRP9ryQZ65vzJKt+7jq+Vks\n+n5vpMOSAiarkdfizOzajFY65ybkcDwiBV6v5tU5v0pJ7nl3Mb3HzONvVzbg1tY1MUvvflCRnJVl\nUgCuIuO7k5UUREKgYdU4Ph3cjmEfLuXRSatYsnUv/7z2QooVDnYEXZGzk9Un7Hvn3B1hiUREfiOu\nWCFevTWB0dOTeHbqetbuPMiLNzenVvnikQ5N8rGsrinofFUkgqKijCGX1OXNfhex88BRuj8/mykr\n1WxVQierpHBrVhswVXSKhFzHehWYPKQdtSsUZ+C7i/nHZ6vV26qERFZJ4XkzG2Jm5wQuNLPCZnax\nmb0F3Ba68ETktOplivHhwNbc0qomr8zaTN9X5rNLva1KDssqKXQBTgJjzWy7ma02s03ABqAP8Jxz\n7s0QxygiviIx0TzR84Jfe1sdOYt5G3+MdFiSjwTV9xGAmRUCygNHnHP7QhpVJnTzmohn/a6DDHx3\nEVv2/MyDV5zHwA51iIpSba6kL0duXjOzWDMbamajgH5ASiQTgoj8ql6lkkwa3I6uF1bh6Snr6P9O\nIvsP/xLpsCSPy6r66C0gAViB13X2f0MekYgErUSRGEb1acqjVzdgxroUrho1i5U/7I90WJKHZZUU\nGjjnbnbOvYw3pkL7MMQkImfAzOjXthYfDGjNiZOOa1+cy7jvtmrwHjkrWSWF1HNRf9AcEcmlmtcs\nw+Qh7WhZqywPTVjBHz9azpHjJyMdluQxWSWFxmZ2wH8cBBqdnjazA+EIUESCV65EEd7sdxH3XVKX\n/y1O5poX5rB5z8+RDkvykEyTgnMu2jlXyn+UdM7FBExnOhyniERGdJTxwGX1eOP2FgF3Qe+MdFiS\nRwQ78pqI5DGdzqsYcBf0Iv75+RrdBS1ZUlIQyccC74Ie8+0meo+Zz/Z9RyIdluRiSgoi+dzpu6BH\n9mnK2h0H6DZyFt+s2RXpsCSXCmlSMLMuZrbOzJLM7KFMyvUyM2dmWd5tJyJnp3vjqky+rz1V4opy\n51uJ/EvVSZKOkCUFM4sGRgNdgQZAHzNrkE65ksB9wIJQxSIinlrlizNxUBtubnUOL3+7iRtfnscP\nqk6SAKE8U7gISHLObXLOHQfGAT3SKfcE8DSg7h5FwiC2kDcW9Ki+TVm/6xDdRsxi6mpVJ4knlEmh\nGrAtYD7ZX5bKzJoCNZxzkzPbkJn1N7NEM0tMSUnJ+UhFCqCrGlVl8pB2VC9TlLveTtQYDQKENilk\nNK6zt9IsCngO+ENWG3LOjXHOJTjnEipUqJCDIYoUbPHli/O/e9qkjtFww8vzSN57ONJhSQSFMikk\nAzUC5qsD2wPmSwIXADPMbAvQCpiki80i4RVbyGudNLpvMzbsOsSVI2fztaqTCqxQJoWFQF0zq2Vm\nhYHewKTTK51z+51z5Z1z8c65eGA+0N05p8ESRCLgykZVmDykHTXKFuXutxN5cvJqjp9QdVJBE7Kk\n4HegNxj4ElgDfOicW2Vmj5tZ91DtV0TO3unqpNta1+TV2Zu5/uV5bPtJ1UkFSdAjr+UWGnlNJDw+\nX7GDP3+0HDN45vrGXN6wcqRDkmzIkZHXRKTg6nZhFSbf146a5YrT/51F/P3TVRw7oa648zslBRHJ\nUM1yxfnontbc3iaeN+Zs4doX5rIx5VCkw5IQUlIQkUwViYnmse4NGXNLc7bvO8JVI2drZLd8TElB\nRIJyecPKTBnagWY1S/PQhBUMem8x+w4fj3RYksOUFEQkaJVKxfLOHS15uGt9vl69i64jZjFv44+R\nDktykJKCiJyRqChjQMc6TBjUhthC0fR9dT7/+XKtusjIJ5QUROSsNKpemslD2nF98+qMnr6RXi/N\n4/sfNR50XqekICJnrXiRGJ7u1ZjRfZuxOcXrcXXC4mRdhM7DlBREJNuubFSFL4Z2oGHVOB74cBn3\nj1vKgaO/RDosOQtKCiKSI6qVLsrY/q34w2X1+GzFDrqNmMWi73+KdFhyhpQURCTHREcZQy6py/iB\nrTGDG16ez4ipGzihi9B5hpKCiOS4ZueU4fP72tO9cVWem7qe3mPma5yGPEJJQURComRsIZ67sQnP\n3diYtTsP0nXELCYu0UXo3E5JQURC6pqm1fn8vvbUq1SSYR8s4553F7Pn0LFIhyUZUFIQkZA7p1wx\nPhzQmoe71mfa2t1c8dy3TFm5I9JhSTqUFEQkLKL9O6En39eOKqVjGfjuYoZ9sJT9h9V0NTdRUhCR\nsKpXqSQTB7Xl/kvqMmnZdi4fPpMZ63ZHOizxKSmISNgVio5i2GX1+HhQW0rFFuL2Nxbyl4krOHTs\nRKRDK/CUFEQkYi6sHsenQ9oxoENtxn63la4jvmXBJvW6GklKCiISUbGFonm42/mMH9CaKDN6vzKf\nJyav5ugvGvozEpQURCRXSIgvyxf3t+fmljV5bfZmuo2cxdJt+yIdVoET0qRgZl3MbJ2ZJZnZQ+ms\nH2hmK8xsqZnNNrMGoYxHRHK3YoVjeKLnBbxz50UcOX6S616cy3+/WsfxE+omI1xClhTMLBoYDXQF\nGgB90vnSf985d6FzrgnwNPBsqOIRkbyjfd0KTBnagWuaVuP5aUn0GD2HNTsORDqsAiGUZwoXAUnO\nuU3OuePAOKBHYAHnXOC7XBzQ/e8iAkBc0UI8c31jXrk1gZSDx+g+ajbDp67n2AldawilUCaFasC2\ngPlkf9lvmNm9ZrYR70zhvvQ2ZGb9zSzRzBJTUlJCEqyI5E6XNajEV8M60PWCKgyfuoErR84mcYu6\n5A6VUCYFS2fZ784EnHOjnXN1gD8Df0tvQ865Mc65BOdcQoUKFXI4TBHJ7coWL8zIPk154/YWHDl+\nkl4vzeNvH6/QQD4hEMqkkAzUCJivDmzPpPw4oGcI4xGRPK5z/Yp8NawDd7StxfsLtnLZszP5ctXO\nSIeVr4QyKSwE6ppZLTMrDPQGJgUWMLO6AbNXAhtCGI+I5APFi8TwyNUNmDioLWWKFWbAO4sY8E4i\nuw4cjXRo+ULIkoJz7gQwGPgSWAN86JxbZWaPm1l3v9hgM1tlZkuBB4DbQhWPiOQvjWuU5tMh7fhT\nl/OYsS6FS/87k3fnf8+pU2qvkh2W1wa8SEhIcImJiZEOQ0Rykc17fuYvE1Ywb9OPtIgvw7+ubcS5\nFUtEOqxcxcwWOecSsiqnO5pFJM+rVb4479/dkqd7NWL9rkN0GzGLEVM36Ka3s6CkICL5gplxQ0IN\npj7QkSsuqMxzU9dz5chZar56hpQURCRfqVCyCM/7zVcPq/nqGVNSEJF86XTz1X5t43nPb746efl2\n8tp11HBTUhCRfKt4kRgevbohEwe1pWzxIgx+fwl9X1nAup0HIx1arqWkICL5XpMapZk8pB1P9LyA\nNTsP0G3kLB6btIr9R1SllJaSgogUCNFRxi2tajL9D53oc1EN3p63hc7PzGDcd1t1b0MAJQURKVDK\nFC/Mkz0vZNLgdtQuX5yHJqyg5wtzWLJ1b6RDyxWUFESkQLqgWhzjB7Zm+I1N2Ln/KNe8MJcHxy8j\n5eCxSIcWUUoKIlJgmRk9m1Zj2oOdGNCxNp8s/YGLn5nBq7M28cvJgnnjm5KCiBR4JYrE8HDX85ky\ntAPNapbhyc/W0HXELGZv2BPp0MJOSUFExFenQgne7NeCV29N4PiJU9z82gIGvrOIbT8djnRoYRMT\n6QBERHITM+PSBpVoV7c8r87axOjpG5m+bjf3dKrDgA51KFo4OtIhhpTOFERE0hFbKJrBF9flmz90\n5LIGlRg+dQOdnpnO2O+2ciIfX29QUhARyUTV0kUZ1bcZ4we2pnqZYjw8YQVXDP+WL1ftzJddZigp\niIgEoUV8WT4a2JqXb2kOwIB3FnHdi3P5bnP+6oVVSUFEJEhmxhUNK/Pl0A78+9oL+WHfEW54eR53\nvbWQ9bvyR39KGnlNROQsHTk5H2wSAAAOpUlEQVR+kjfmbubFGRv5+dgJrmtWnWGX1aNq6aKRDu13\ngh15TUlBRCSb9v58nBdmJPHW3O/BoF+beO7pVIfSxQpHOrRUSgoiImGWvPcwz329gQlLkilZJIZB\nnc/l9jbxxBaKfDNWJQURkQhZu/MAT09Zx7S1u6kSF8uwS+txXfPqREdZxGIKNimE9EKzmXUxs3Vm\nlmRmD6Wz/gEzW21my83sGzOrGcp4RETCoX7lUrx+ewvG9W9FpVKx/Ol/y+ky/Fs+W74j13fTHbKk\nYGbRwGigK9AA6GNmDdIUWwIkOOcaAR8BT4cqHhGRcGtVuxwTB7XhpZubcco57n1/MVcM/5ZPlv7A\nyVyaHEJ5pnARkOSc2+ScOw6MA3oEFnDOTXfOne5UZD5QPYTxiIiEnZnR5YIqfDWsI8/3aUqUGfeP\nW8qlz85kfOK2XNcbayiTQjVgW8B8sr8sI3cCX6S3wsz6m1mimSWmpKTkYIgiIuERHWVc3bgqX9zf\nnpdubkaxwtH88aPldH5mBu8v2MqxEycjHSIQ2qSQ3hWVdM+XzOxmIAH4T3rrnXNjnHMJzrmEChUq\n5GCIIiLhFRXlnTlMHtKO125LoFyJIvxl4go6/WcGb83dwtFfIpscQpkUkoEaAfPVge1pC5nZpcBf\nge7OuYI95JGIFBhmxiXnV+LjQW14+46LqF6mKI9OWkWHp6fz6qxNHDkemeQQsiapZhYDrAcuAX4A\nFgJ9nXOrAso0xbvA3MU5tyGY7apJqojkR8455m/6iZHfbGDeph8pV7wwd7WvzS2ta1KiSPZHOcgV\n9ymYWTdgOBANvO6c+4eZPQ4kOucmmdlU4EJgh/+Urc657pltU0lBRPK7xC0/MXJaEt+uT6F0sULc\n0bYWt7WJJ65oobPeZq5ICqGgpCAiBcXSbfsYNW0DU9fspmSRGJ685gJ6NMmsvU7Ggk0KGnlNRCSX\nalKjNK/e1oJV2/czaloS8eWKh3yfSgoiIrlcw6pxvHhz87DsS+MpiIhIKiUFERFJpaQgIiKplBRE\nRCSVkoKIiKRSUhARkVRKCiIikkpJQUREUuW5bi7MLAX4/iyfXh7Yk4Ph5DTFlz2KL/tye4yK7+zV\ndM5lOfZAnksK2WFmicH0/REpii97FF/25fYYFV/oqfpIRERSKSmIiEiqgpYUxkQ6gCwovuxRfNmX\n22NUfCFWoK4piIhI5gramYKIiGQi3yUFM7vezFaZ2SkzS0iz7mEzSzKzdWZ2RQbPr2VmC8xsg5l9\nYGaFQxjrB2a21H9sMbOlGZTbYmYr/HJhG3bOzB4zsx8CYuyWQbku/jFNMrOHwhjff8xsrZktN7OJ\nZlY6g3JhPX5ZHQ8zK+K/90n+Zy0+1DEF7LuGmU03szX+/8n96ZTpZGb7A973R8IVn7//TN8v84z0\nj99yM2sWxtjOCzguS83sgJkNTVMmoscv25xz+eoBnA+cB8wAEgKWNwCWAUWAWsBGIDqd538I9Pan\nXwLuCVPc/wUeyWDdFqB8BI7lY8CDWZSJ9o9lbaCwf4wbhCm+y4EYf/op4KlIH79gjgcwCHjJn+4N\nfBDG97QK0MyfLgmsTye+TsDkcH/egn2/gG7AF4ABrYAFEYozGtiJ1/4/1xy/7D7y3ZmCc26Nc25d\nOqt6AOOcc8ecc5uBJOCiwAJmZsDFwEf+oreAnqGMN2C/NwBjQ72vELgISHLObXLOHQfG4R3rkHPO\nfeWcO+HPzgeqh2O/WQjmePTA+2yB91m7xP8MhJxzbodzbrE/fRBYA5zdoL+R0wN423nmA6XNrEoE\n4rgE2OicO9ubaXOlfJcUMlEN2BYwn8zv/xnKAfsCvmjSKxMK7YFdzrkNGax3wFdmtsjM+ochnkCD\n/VP0182sTDrrgzmu4XAH3q/H9ITz+AVzPFLL+J+1/XifvbDyq62aAgvSWd3azJaZ2Rdm1jCsgWX9\nfuWWz1xvMv4hF8njly15coxmM5sKVE5n1V+dc59k9LR0lqVtehVMmTMSZKx9yPwsoa1zbruZVQS+\nNrO1zrlvsxNXMPEBLwJP4B2DJ/CquO5Iu4l0nptjTdqCOX5m9lfgBPBeBpsJ2fFLR0Q+Z2fKzEoA\n/wOGOucOpFm9GK9K5JB/HeljoG4Yw8vq/coNx68w0B14OJ3VkT5+2ZInk4Jz7tKzeFoyUCNgvjqw\nPU2ZPXinojH+L7j0ypyRrGI1sxjgWiDDUbmdc9v9v7vNbCJeFUWOfKkFeyzN7BVgcjqrgjmuZy2I\n43cbcBVwifMrdNPZRsiOXzqCOR6nyyT7738c8FOI4vkdMyuElxDec85NSLs+MEk45z43sxfMrLxz\nLix9+gTxfoX0MxekrsBi59yutCsiffyyqyBVH00CevstP2rhZe7vAgv4XyrTgV7+otuAjM48csql\nwFrnXHJ6K82suJmVPD2Nd3F1ZYhjOr3vwHraazLY70KgrnmttgrjnVJPClN8XYA/A92dc4czKBPu\n4xfM8ZiE99kC77M2LaOEltP8axevAWucc89mUKby6WscZnYR3vfEj2GKL5j3axJwq98KqRWw3zm3\nIxzxBcjw7D6Sxy9HRPpKd04/8L68koFjwC7gy4B1f8VrGbIO6Bqw/HOgqj9dGy9ZJAHjgSIhjvdN\nYGCaZVWBzwPiWeY/VuFVm4TrWL4DrACW4/0jVkkbnz/fDa8Vy8Ywx5eEV7e81H+8lDa+SBy/9I4H\n8Dhe8gKI9T9bSf5nrXYYj1k7vKqW5QHHrRsw8PTnEBjsH6tleBfw24QxvnTfrzTxGTDaP74rCGhl\nGKYYi+F9yccFLMsVxy8nHrqjWUREUhWk6iMREcmCkoKIiKRSUhARkVRKCiIikkpJQUREUikpiIhI\nKiWFAsjM/up3m7zc79q3ZQ5t91AGy0/6+1nl9wfzgJlF+esSzGykP13EzKb6ZW80s/b+c5aaWdGc\niDEUzGy4mXUwr/vupX6XzoFdJ7cJUxz3m9lGM3OWQTfimTz3kYB4TwZM3xuqeNPs/2HzuvNeZmZf\nmVk1f3k1MwvLzZDi0X0KBYyZtQaeBTo5546ZWXmgsPO7Fsjmtg8550pkttzvz+Z9YI5z7tE05Vrh\ndX/d0Z9/Ca9b5DeC3L/hfaZPZfOlBM3MyuLdKNcqYFknvC7HrwpXHP5+m+J1lzEHuMA5t+8sthED\n7HHOnVFSyS4zuxiY65w7ambDgCbOudv8dWOBZ5xzi8IZU0GlM4WCpwreP/0xAOfcntMJwcyam9lM\n83qn/PJ0NxdmVsfMpvjLZ5lZfX95LTObZ2YLzeyJYHbunNsN9MfrfdXMG5Bksp8s3gWa+L9QB+B1\nJ/6Imb3n7++P/r6Wm9nf/WXx/i/MF/A6IqthZpf7cS02s/Hmdf52evCWv/vLVwS8jhJm9oa/bLmZ\nXecvT3c7afQCpmT1us2sRcCx/cLMKvnLB/qvaZm/j6L+8nfNbLR5A+Js9M9E3jJvUKHXMji2S1wI\nunE2r9uGj/04F5jXdQNm1tY/PkvMbLaZ1Ql4TR+Z2WdmttnM7jazhwLKxaUT+zTn3FF/Nm036B8D\nN+X065IMRPqWaj3C+wBK4HVtsB54AejoLy8EzAUq+PM3Aq/7098Adf3plnh99YDfB40/fS9wKIN9\n/m45sBeoRMCAJKQZnASvC5Be/vTleIOiG96PmclAByAeOAW08suVx+s8rbg//2f8wYvwBm8Z4k8P\nAl71p58Chgfst0xm20nzOt4Crk6zLO3rKOIf2/L+/E3AGH+6XEC5f+MP6oSXIN/1p6/D6167gf/a\nl+KdCWT0HicDpc/y8xGD13184LL/AS386drAcn86Dn+gKrxOCd/zpwcCq/G6g6gKHARu99e9SJpu\nXdKJ4VUCBncC6gALI/2/U1AeebKXVDl7zuvOtzneGA6dgQ/MGzIyEbgAr6ti8EaV2uH/Om4DjLdf\nx4Ep4v9ti/eFBV4/SU+dQShnOqjM5f5jiT9fAq9Tw63A984bbAW8kbgaAHP8eAsD8wK2c7pX0EV4\nvdOC1ylh79MFnHN7zeyqLLZzWhUgJYvYzwcaAlMDju3pDhAbmdnjQGm8kdACe6L91P+7AtjunFsN\nYGar8ZJhWDpGxBtMpk7A+1/OvM7+ygLvmFltvPfzl4DnfOO8TgoPm9kRfvta4jPakZndCdQD7glY\nvBsvuUgYKCkUQM65k3jDlc4wsxV4PXYuAlY551oHljWzUni/HJtktLkz3b//JXIS75/9/GCfBvzL\nOfdymm3FAz+nKfe1c65PBts55v89ya+ffyP9MQ8y285pR/A6uMuM4f26bp/OurfxOmdcaWZ34SW1\ntLGeCpg+PX9W/7tm9jbQCNjqnOseRPnTmSDB/Tr41Ol1/8I7IxrjV8V9nE7saePPMHYzuxIYBnRw\nzgUmmFi84yxhoGsKBYx5A48HDvjRBPger+fYCuZdiMbMCplZQ+f1Db/ZzK73l5uZNfafO4dff2EH\nVedrZhXwxr4e5fy6gSB9CdwRcH2gmn8dIq35QFszO9cvV8zM6mWx7a/werY8HWOZM9jOGuDcLLa/\nGqgWUBdf2H4djas4sNO8MQ76ZrGdbHPO3eqcaxJMQvDLO2AaAb/czez0D4Q44Ad/+vbsxOUfmxHA\nVc65tGNL1CN8Z0UFnpJCwVMCeMvMVpvZcrwqksecN55wL+ApM1uGV299uinlTcCd/vJV/Drm8P3A\nvWa2EO8LIiNFzW+SCkzF+xL++5kE7Zz7Cq/V0jz/7OYjvOqWtOVS8L6gxvqvbz5QP4vNPwmUMbOV\n/mvsfAbb+QzvGkJmsR/DO7bP+ttfgndtBuARvO6zv8ZLHmfNvKa+yXgj1a0ys5ezek6Q7gE6+xfh\nV/Pr6Hv/Aoab2Ry8M6/seBbvs/mx/1n5KGBdZ7zjLGGgJqki2WRms/F+4Z5xE1DJnF99NRvo4pw7\nGOl4CgIlBZFsMu/mvyPOueWRjiW/Ma9ZdHPnXHpDwUoIKCmIiEgqXVMQEZFUSgoiIpJKSUFERFIp\nKYiISColBRERSfX/cdjlE1FHkI4AAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x1a18613908>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"# Vizualization\n", | |
"X = np.arange(-10, 10).reshape(-1, 1);\n", | |
"preds = clf.predict_proba(X)[:,1];\n", | |
"\n", | |
"plt.plot(X, preds);\n", | |
"plt.xlabel('Seed Difference (Team 1 - Team 2)');\n", | |
"plt.ylabel('P(Team 1)');" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Average Ranking Based Logistic Regression Model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"data_dir = './March Madness 2018/'\n", | |
"df_massey = pd.read_csv(data_dir + 'MasseyOrdinals.csv')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>RankingDayNum</th>\n", | |
" <th>SystemName</th>\n", | |
" <th>TeamID</th>\n", | |
" <th>OrdinalRank</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>35</td>\n", | |
" <td>SEL</td>\n", | |
" <td>1102</td>\n", | |
" <td>159</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>35</td>\n", | |
" <td>SEL</td>\n", | |
" <td>1103</td>\n", | |
" <td>229</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>35</td>\n", | |
" <td>SEL</td>\n", | |
" <td>1104</td>\n", | |
" <td>12</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>35</td>\n", | |
" <td>SEL</td>\n", | |
" <td>1105</td>\n", | |
" <td>314</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>35</td>\n", | |
" <td>SEL</td>\n", | |
" <td>1106</td>\n", | |
" <td>260</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season RankingDayNum SystemName TeamID OrdinalRank\n", | |
"0 2003 35 SEL 1102 159\n", | |
"1 2003 35 SEL 1103 229\n", | |
"2 2003 35 SEL 1104 12\n", | |
"3 2003 35 SEL 1105 314\n", | |
"4 2003 35 SEL 1106 260" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_massey.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Create composite final rankings\n", | |
"final_day = 133\n", | |
"df_final_rankings = df_massey.loc[df_massey['RankingDayNum'] == final_day]\n", | |
"df_final_rankings = df_final_rankings.groupby(['Season', 'TeamID'])['OrdinalRank'].mean()\n", | |
"df_final_rankings = df_final_rankings.reset_index()\n", | |
"df_final_rankings.rename(columns={'OrdinalRank':'Avg. Rank'}, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>TeamID</th>\n", | |
" <th>Avg. Rank</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>1102</td>\n", | |
" <td>156.03125</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>1103</td>\n", | |
" <td>168.00000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>1104</td>\n", | |
" <td>38.03125</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>1105</td>\n", | |
" <td>308.96875</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>1106</td>\n", | |
" <td>262.68750</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season TeamID Avg. Rank\n", | |
"0 2003 1102 156.03125\n", | |
"1 2003 1103 168.00000\n", | |
"2 2003 1104 38.03125\n", | |
"3 2003 1105 308.96875\n", | |
"4 2003 1106 262.68750" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_final_rankings.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>WScore</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>LScore</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1116</td>\n", | |
" <td>63</td>\n", | |
" <td>1234</td>\n", | |
" <td>54</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1120</td>\n", | |
" <td>59</td>\n", | |
" <td>1345</td>\n", | |
" <td>58</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1207</td>\n", | |
" <td>68</td>\n", | |
" <td>1250</td>\n", | |
" <td>43</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1229</td>\n", | |
" <td>58</td>\n", | |
" <td>1425</td>\n", | |
" <td>55</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1242</td>\n", | |
" <td>49</td>\n", | |
" <td>1325</td>\n", | |
" <td>38</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID WScore LTeamID LScore\n", | |
"0 1985 136 1116 63 1234 54\n", | |
"1 1985 136 1120 59 1345 58\n", | |
"2 1985 136 1207 68 1250 43\n", | |
"3 1985 136 1229 58 1425 55\n", | |
"4 1985 136 1242 49 1325 38" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['WLoc', 'NumOT'], inplace=True, axis=1)\n", | |
"df_tour.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>WAvgRank</th>\n", | |
" <th>LAvgRank</th>\n", | |
" <th>RankDiff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>240.343750</td>\n", | |
" <td>239.281250</td>\n", | |
" <td>1.062500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>2.676471</td>\n", | |
" <td>153.125000</td>\n", | |
" <td>-150.448529</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>36.000000</td>\n", | |
" <td>21.705882</td>\n", | |
" <td>14.294118</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>45.687500</td>\n", | |
" <td>20.735294</td>\n", | |
" <td>24.952206</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>36.406250</td>\n", | |
" <td>50.312500</td>\n", | |
" <td>-13.906250</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID WAvgRank LAvgRank RankDiff\n", | |
"0 2003 134 1421 1411 240.343750 239.281250 1.062500\n", | |
"1 2003 136 1112 1436 2.676471 153.125000 -150.448529\n", | |
"2 2003 136 1113 1272 36.000000 21.705882 14.294118\n", | |
"3 2003 136 1141 1166 45.687500 20.735294 24.952206\n", | |
"4 2003 136 1143 1301 36.406250 50.312500 -13.906250" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Join tournament results\n", | |
"df_win_ranks = df_final_rankings.rename(columns={'TeamID':'WTeamID', 'Avg. Rank':'WAvgRank'})\n", | |
"df_loss_ranks = df_final_rankings.rename(columns={'TeamID':'LTeamID', 'Avg. Rank':'LAvgRank'})\n", | |
"df_dummy = pd.merge(left=df_tour, right=df_win_ranks, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_loss_ranks, on=['Season', 'LTeamID'])\n", | |
"df_concat['ScoreDiff'] = df_concat['WScore'] - df_concat['LScore']\n", | |
"df_concat['RankDiff'] = df_concat['WAvgRank'] - df_concat['LAvgRank']\n", | |
"df_total = df_concat[['Season', 'DayNum', 'WTeamID','LTeamID', 'WAvgRank', 'LAvgRank', 'RankDiff']]\n", | |
"df_total.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAIABJREFUeJztnX+MHVeV57+nn5/xawfcHXAgaeKx\nByFnYLPGQ4sEWVpNMkPM8rMHYiDazHp3WayVdlcbQL04moiQVSTMtnaYmZ3VrqIBrVEY4ySETpgw\na1gSNFq0NtOmbYyHWISQxHnOkszE7SHxS/K6++wf/apTXe/eqlu/69X7fiTL/epV1T33VvXpW986\n9xxRVRBCCBl8Rso2gBBCSDbQoRNCSE2gQyeEkJpAh04IITWBDp0QQmoCHTohhNQEOnRCCKkJdOiE\nEFIT6NAJIaQmrCuysTe84Q26devWIpskhJCB5/jx43+nqpuj9ivUoW/duhVzc3NFNkkIIQOPiDzp\nsh8lF0IIqQl06IQQUhPo0AkhpCbQoRNCSE2gQyeEkJpQaJQLMTM738bMkTM4t9DBFWMtTO/ejqmd\nE2WbRQgZMOjQS2Z2vo1b7z+FTncJANBe6ODW+08BAJ06ISQWlFxKZubImVVn7tHpLmHmyJmSLCKE\nDCp06CVzbqETazshhNigQy+ZK8ZasbYTQogNOvSSmd69Ha1mY822VrOB6d3bS7KIEDKo8KVoyXgv\nPhnlQghJCx16BZjaOUEHTghJDSUXQgipCXTohBBSE+jQCSGkJtChE0JITaBDJ4SQmkCHTgghNYEO\nnRBCagIdOiGE1AQ6dEIIqQl06IQQUhPo0AkhpCbQoRNCSE2gQyeEkJrglG1RRJ4A8GsASwAWVXVS\nRC4FcBjAVgBPAPiYqp7Px0xCCCFRxJmhX6eq71DVyd7n/QC+r6pvBfD93mdCCCElkUZy+TCAg72f\nDwKYSm8OIYSQpLg6dAXwXRE5LiL7etveqKrPAEDv/8vyMJAQQogbrhWLdqnqORG5DMD3RORR1wZ6\nfwD2AcCWLVsSmEgIIcQFpxm6qp7r/f8sgG8BeBeAX4nI5QDQ+/9Zy7F3qeqkqk5u3rw5G6sJIYT0\nEenQRWSjiLzW+xnADQB+CuBBAHt7u+0F8EBeRhJCCInGRXJ5I4BviYi3/1+o6v8Skb8BcI+IfBLA\nUwD25GcmIYSQKCIduqo+DmCHYfvfA/jdPIwihBASH64UJYSQmkCHTgghNYEOnRBCaoJrHDohhTM7\n38bMkTM4t9DBFWMtTO/ejqmdE2WbRUhloUMnlWR2vo1b7z+FTncJANBe6ODW+08BAJ06IRYouZBK\nMnPkzKoz9+h0lzBz5ExJFhFSfThDJ5Xk3EIn1vaqQbmIlAFn6KSSXDHWirW9SnhyUXuhA8WrctHs\nfLts00jNoUMnlWR693a0mo0121rNBqZ3by/JIncoF5GyoORCKoknTwyibDHochEZXOjQSWWZ2jkx\nEA48yBVjLbQNznsQ5CIy2FByISRjBlkuIoMNZ+iEZMwgy0VksKFDJyQHspCLGPpI4kKHTkgF4UpZ\nkgRq6IRUEIY+kiTQoRNSQRj6SJJAySUnBkn/HCRbhwWGPpIkcIaeA4O09HuQbB0mGPpIkkCHngOD\npH8Okq3DxNTOCXzxI1djYqwFATAx1sIXP3I1n5xIKJRccmCQ9M9BsnXYGNSVsqQ86NBzoCr6p4s2\nXhVbCSHpoeSSA1XQP1218SrYSgjJBjr0HKiC/umqjVfBVkJINlRechnUkLqy9U+bBm6SV8q2lRCS\nDZWeoTOkLjk2DVwAjh8hNaXSDp0hdcmZ3r0dYtiuAMePkJri7NBFpCEi8yLyl73P20TkmIj8XEQO\ni8j6rI1jSF1ypnZOQC3fcfwIqSdxZuj/AcDPfJ+/BODLqvpWAOcBfDJLw4DBLhRcBSYcx292vo1d\nBx7Gtv0PYdeBh2NLMmmPJ4Rkg5NDF5E3A3g/gD/vfRYA1wO4r7fLQQBTWRvHkLp0uIxf2vcUfM9B\nSHVwnaH/MYD/CGC59/n1ABZUdbH3+WkAmYdJMKQuHS7jl/Y9Bd9zEFIdIsMWReQDAJ5V1eMi8jve\nZsOuRslWRPYB2AcAW7ZsiW1gWEjdoIY0psGlz3HGJe17imF6z5HmfhvGe5UUj0sc+i4AHxKR9wHY\nAOB1WJmxj4nIut4s/c0AzpkOVtW7ANwFAJOTk7b3dLEZxoouLn2OOy5pl/4PS+qANPfbMN6rpBwi\nJRdVvVVV36yqWwF8AsDDqvrPADwC4MbebnsBPJCblQaG8VHfpc9xxyXte4phec+R5n4bxnuVlEOa\nOPTPAfiMiDyGFU39K9mY5MYwPep7uPQ57rikfU8xLO850txvw3ivknKItfRfVX8A4Ae9nx8H8K7s\nTXKjzo/6Nr3Vpc9JxiXt0v9hSB2Q5n6r871KqkWlV4qGUddH/bAwQJc+13VcyibNuPKakKKofHIu\nG96MsG6RA2F66w/3X7+6j63PdR2XskkzrrwmpChENbPAk0gmJyd1bm6usPYGkW37HzLGfwqAXx54\nf9HmEEIqgIgcV9XJqP0GVnKpK0x3QAhJCh16xaDeSghJysBq6HWFeishJCl06DmRZqn3MIQBEkKy\nhw49B7jUmxBSBtTQc4BLvQkhZUCHngNc6k0IKYPaSi5lpivlUu9qw1S2pK7UcoZedhUdhh5Wl7Lv\nDULypJYOvWwNe1gyEA4iZd8bhORJLSWXKmjYgxZ6aJMh6iZPVOHeICQvaunQqWHHwxZmOffk8/jm\n8Xatwi95b5A6U0vJhRp2PGwyxKFjZ2snT/DeIHWmljN0Lp+Ph01uWLJk4hxkeYL3BqkztXTowOBp\n2EUQtxJSQ8To1Addnij63oj7HqJu7y1IcdRSciH9JKmEdNM1V1KeSEncMEmGVZI00KEPCWHherYw\nyzunrmb4ZUrihkkyrJKkobaSC1lLVLieTYagdJWOuGGSDKskaaBDHxKqGq5XN73Y6097oYOGiLGc\nIGAe99n5NkZq+t6CFAMllyGhiuF6ddOL/f0B7FFCpnH3jjUdU/Z1IoMDHfqQUMV0BHXTi039CWIb\nd9uxDZHSrxMZHCi5OFIHacCvh3v9+fThE6X1x1UvznLs8zyXSdLyIwB+uP9643dhawHKvk5kcKBD\nd6BuFYiq0h8XXT9LW/M+lwBWzRwI18FtYyG9c6e1lwwHlFwcGAZpoIz+uOj6Wdqa97kUKw7YRJQO\nbhoL0x+IQb7vSP5EOnQR2SAiPxKRkyJyWkTu6G3fJiLHROTnInJYRNbnb2451C2UrCr9cdH1s7S1\niHMpVvoBrOjfgNv7CtNY2Gb7g3rfkfxxkVxeBnC9qr4gIk0A/0dE/grAZwB8WVW/ISL/A8AnAfz3\nHG0tjaqG/CUly/5ksazdpitnbWuacwXt3tRqYqHT7dtvYqxl7M/sfBu7DjwcOk7eOw6vrbB+uFCH\n9z4kHpEzdF3hhd7HZu+fArgewH297QcBTOViYQWoYshfGrLqTxHL2rMc+6TnMtn94iuLaI6sFVhs\n54rT72DoYxDXvtctJJS44aShi0hDRE4AeBbA9wD8AsCCqi72dnkaQG3/9Fcx5C8NWfWniGXtWY59\n0nOZ7O4uKS7ZsM7pXHH6HRb6GKfvVXlPQorFKcpFVZcAvENExgB8C8BvmXYzHSsi+wDsA4AtW7Yk\nNLMc0j6yFvHIm0TyuOPbp3H+4opcMNZqJrYry2XtYf2Ik34gajySpDKw2b1wsYv5z9+Q+HjTdtu+\nYSGPadsk9SFWlIuqLgD4AYBrAYyJiPcH4c0AzlmOuUtVJ1V1cvPmzWlsLZS0j6xFPPImkTym7zu5\n6swBYKHTxfS9JxPZZdNy427f1GpmMlZ5jXnc/gTZ1Go6b0/bVtbnIYOFS5TL5t7MHCLSAvB7AH4G\n4BEAN/Z22wvggbyMLIO0j6xFPPImkTy6S/0PUt1lTWRXXE3atr8IMhmrvMY8rY4vllhG0/as3hnU\n7b0PccNFcrkcwEERaWDlD8A9qvqXIvK3AL4hIncCmAfwlRztLJy0j6xFPPJmmbEviV1xq//Y9v/0\n4ROZ2JTXmKetcrRwsT8axrY9q4pKrMw0nEQ6dFX9CYCdhu2PA3hXHkZVgbThcq7Hp9HZ49oYtjx9\nRASz821j27PzbXzhwdOrYXrjo03c/sG3r+rRcZyEaX8vO6FrP2zYQgmzkBlc+2m6nnGvU1Ypi5n6\nePjgSlELaR9ZXY5Pq/kmkTyaDfPz/5Kqse3Z+Tam7z25xlGev9jF9H3JdHebXWnlgdn5Nl58ZbFv\ne3NECpMZbNfzuqs2U/4ghUCHbiFtuJzL8Wk137g2Tu2cwMyNOzA+an5JZ2p75sgZdJcNuvtSMt3d\nZlfa0ETb+4FLNqwrbJZqu56PPPpcrcJeSXURteRszoPJyUmdm5srrL0qMzvfxi0W7VgA/PLA+3Nt\nf9v+h4xxpsG2bfv590+qz2YZ1hlm50RBGRpdx5SQuIjIcVWdjNqPM/QS8B7NbRQRWuYa1hZlS9Lw\nwKxDDMPsTHPuOHYyVJCUDR16CYStBixKW3XVrV1tiRsemHWIoak/WZw7jp0MFSRlQ4deAmFhdEVp\nq666dRxb4oQHZh1i6O9P3DbDiGNn3VJEkMGDBS4yxq+3jo02oQpc6HTXaK+2MLaJsVZuv/w2Hdil\nvQmHajxAPGkhjwyWXn92HXjYqXCGiy4eZmewIPSS6hq93lYVyt/2plYTIisx6UE7mC2RxIUvRTMk\nWMUmSKvZwBc/cjUA9O3nfZfHL6zJrjjtRfUr7vmysCnNueO0bdv3o++cwDePt41jYvs+6ji/HUCx\n9wipNq4vRenQM8Q2M/Tj5csucvZls8uWu9tE0N7rrtqMRx59LpX9eY5B2LnjjofpXLbFUB7ejN11\ne9AOAKmvGakPrg6dkkuGuGi03j5FruLLQq/Ow948xyDs3HHHw3QuW7oCD5vTjnLmYXZEfUcIHXqG\nuFR+LyOErayKS0lm4EU8udjGI5j9MMyWqGudZoauIfulvWbU5esNo1wyJCp0rqwQtjLC6ZLEmRdV\nZee6q8xpnH/98uJqW1G2hF3rVrOBm6650jjmpu0mTM487TVjFaP6Q4eeIcGwtfHRJsZazdJD2MoI\np0sSZ15UlZ1HHn3OuH3Jl0Y4ypZgmGSwIPSdU1cbxzy4fazVtKZi8M6b1TVjFaP6Q8klY6qU4S6v\nx2uX87ro1MHz2CSMKN34ttlTOHTsLJZU0RDBTddciTunrrbaHSaVeG3Fqa70xx9/h3FcbfeCabst\nbcCyqjFtgEt4rK1vrtvJ4EGHXlOC4Xbe4zUQb7FQ0vNG6fam8wjMdQzDdOPbZk/h7qNPrX5eUl39\n7HfqLqGX/rbCdPY8xjXOe45gX/wVqMLsKetdCikOSi41Ja/Ha9fzRun2pvMoVhJZ2Y4xcejYWaft\nYekWPJqNV1Pt5l1dKUic9xxRfWFqguGFDr2m5PV47XreKN3edh7t7euqG7uGB0b1e3y0iZkbd6wp\nTG2y31Z9KIuqSK7vOeKExyZtgwwmtZNcGJa1govkkWSc4jy2h71PCEt/4LJwxrPfhsjKAiKvf2Oj\nzTXShEt7WVRX8o9zqzmCzuIyVGHU+v1l49oLHXz2npO45fCJvvS/acJjq/SOh2RPrWboDMt6lbDH\n6zTjZAv5s21PYl8UfvttCLCmfy+8tNhXrSnv4svBcb7YXXHmwKta/22zp4z7e/sA8UImPeJeD1IP\nauXQGZb1KmGP12nGyRbyZ9uexL4owjTkhghGmyMIFlnqLis2rl+XWm6IY7eLbu/X+sP2N4VMeqGS\nJuJeD1IPaiW5MCxrLbbH6zTjZNunbQjnC8o4pu/D5BXT/l5bJgTAL774Pmzb/5Dx+wudLk7cfkOs\n9kzJuvwZFtsLnVVHO/fk82vCJ12W+S+prhbnjpJRvDH2rmtY+gHbdaIkWW9q5dAZluVGmnEK02+n\n7z25Wn80GD4XN4zStP/0fSfNcY0B+5P0z8W+4D5+SeQzh09g2Xc+F2fucev9pzD35PPWsM3gvp5N\nYdfCJdwxq5BLUh1qJbkwLMuNNOM0vXt7X2ihR7CYtF8miCvzmPbvLqmxYHXQ/iT9c7EvTBJZNm51\no9NdwqFjZyOdedCm6d3b0Rzpvxr+8Es/lCTrT60cOsOy3EgzTlM7J5wcj4fLyss422347U/SPxc7\nkkp3nh2jTfuvW5wZvT9j58yeHRjzJRULhl+ajnPdTgaPWkkuQPKwrEHTFqOWu8/Ot3HHt0+vhuqN\ntZr4wofevsbpJe2fawUjILkM4hKa57cnSquP6quLfXFsCvLlXnoAWy52F7nFZFOc61iEJDlov0d1\no1Yz9KQMWrijt9zdm9UFQ+Bm59uYvu/kmrjrhU4X0/eezKRPJkmj2ZC+x/80MkiSNoDk19LFvrBw\nwbBfJL8NNpnE5Mxt0lbSkMS8JclB+z2qI3ToGDxtMWq5+8yRM+gu9buIri+bYBpMksbMjTsws2eH\nVeaIK4MkacPre5Jr6WJfWIbFP/r4O3DztVusoYSeDVM7J3DJhnQPxklDEvOWJAft96iORN5ZInIl\ngK8BeBNW3v3cpap/IiKXAjgMYCuAJwB8TFXP52dqfgyathi13L2IijdhmQTjHpNlG2mupYt9YftM\n7ZzAnVNXWzMnejbY0gcEsUkwaa5hnitFB+33qI64TBUWAXxWVX8sIq8FcFxEvgfgXwD4vqoeEJH9\nAPYD+Fx+puanzxUZ7phFH8L01m37H8JISAy0S59cbXRNW1sktms5NmquRuTFky+p9i2xtxF8PzHa\nHIEC6HTDY10UwNb9DznHqIed5y23fmeNzQCcY+jz0rcZNlw+kZKLqj6jqj/u/fxrAD8DMAHgwwAO\n9nY7CGAqLyOBfPW5osIds+jD7HwbIwYN1kNhn8E3R8zhbElsjNLxy2J69/a+Jf4A8MJL5mpEgH2J\nvQnT+4mL3eVIZ+4njTMPnqO90MH0vScxfd/J0GtWhL7NsOHyiaWhi8hWADsBHAPwRlV9Blhx+gAu\ny9o4P3nqc0WFO2bRh5kjZ7BkicUO4vf7Y60mZvaYw9mS2OiatrZopnZOYOP6/gfPbkQ1Ig+Xqkqm\n9xNl0l3WPptcYuiz1rcZNlw+zm9nROQSAN8EcIuq/oOE5JEIHLcPwD4A2LJlSxIbAeSvzxWRhS6L\nPsQJm1MFnghUu4l67Hat1GNzaUuq2HXgYadl/8CrMsGmVhMiK8Ua4kogQS50wlPcRo13sK+ebQsX\nu7Fi8MumvdBZvRZhKRuyhNkcy8Vphi4iTaw486+r6v29zb8Skct7318O4FnTsap6l6pOqurk5s3J\nM8CFxSsPCmn7MDvftoayuZzX5bHbZsvYaHPNsWEEz2tqNygTLHS6qzJGHAnEpd/B7VHjHeyrZ9sg\nOXMPbwyD7xA8BGBYYY2IdOiyMhX/CoCfqeof+b56EMDe3s97ATyQvXmvUgd9Lm0fZo6csTqVqPhs\n7/iox26bjar9lXrCiFr2b5IJos7jStQ4h8WTJ+lr1el0l6BqjmtXgGGFNcJlhr4LwB8AuF5ETvT+\nvQ/AAQDvEZGfA3hP73Nu1EGfS9uHMKkgKj477Hj/dpuNNhnDxd68qiTZiBrnsHjypH2tOhc69icM\nhhXWB9EM3ri7Mjk5qXNzc4W1Vzdsy8Zdq/zEOT6oeb/48iIWDI7Oc4rG5ewCZHF7NUSwrLpG8w97\nF2D7zr99bLQJ1RVHFzzeNk5lMz7axD90FhNFyYRdJ//1TzKufrj0Px9E5LiqTkbtV7tcLnVmevf2\nvsr1cSQb1+NNaVa9Zff+bIf+Y4PnBcKdeXNEAIGT7BLU1OeefB7fPN42poEN2mI7xh92GEwjaxqn\nKnCh0+0r3OHK1te3sGdyS+j1D0uvC5jHFbCnF2Z63uLhDH3ASDsDcjneNkMdH21idP26yNlb2MIm\nAMbFMKYoF9sCHNv2sFmoy2Ie20zVH+VyxVgLW1/fwtHHz685X0MEr1knuBgjHj0rxkebeLm7FNp2\nQwS/+OL7Qq9/2BMcED27T/sESexwhl4B8nj8NIWFxWnHJazMpqkuXOzi9g++fbUtf0k079/sfBu3\nhFTSAYD/d+El3HL4ROTq0q2WykM2xxymBbvIFCZn5EW5eJx/8WVcd9VmnD7369Xt46NNvP8fX467\njz4V2UbWCID5z99grdLksaS6Op5jreZq9kc/YaGNtugql/TC1OiLg8m5cqKozHN5tOMaumgLT4zC\nZXVpEvuvGGtZbXdcNoHbZk+FFqG+2F3G3UefWuvkL3ZLceaAeyimH1vmTevYoT91gumYOoQWDzp0\n6DlRVOa5PNqJE7oYFZ7ogml1aVz7PS3YZntrndutfujY2cT9yJtgSoNgKKYp5YENU+ZNWzUqxcr7\nkCTphQcttHjQoUPPiaIeP/NoJ27ookt4or+qThCTHBLHfn9Yos1211wrS6qVlQhmbgxPTzxz4w6M\nW2bSJoL9DKtGdaHTjZVeeFBDiwcdaug5kVXmudn5Nr7w4OnVR/yN6xtoNkZWw+3GRptrIjaSthPE\nr7V7Gr3tlz2qKtH4aBPzn79hNUOgiW37H1qj/7tWB2qIWEMZ/Tqxl1nR5Xxv2rShkmGLtxw+gfHR\nFf0bAO749unV9xVeRar5z9+wuv/sfBufvedkrMybtmpUV/SqQqVJL0zyhzP0nMji8XN2vo3pe0+u\n0WtffGUJC71FIu2FDl54aTH0UTwtYXpysK2oTIc3XXOltZ2gJm+r7BNkSRW33n8Kt82eCtX3w1aH\n+rnpmisTVwQqgvMXu/jMPSfw2XvDK1J51y1u5k3KJoMNHXpOZPH4OXPkjLXKvUd3WbFx/bpCq9B4\nmB77wzId3jl1dWhVH2BtZZ9gAeTR5ojx5Wanu4RDx86G6vu21aEeDRHcfO0W3Dl1dWhFoNHmCG6+\ndkuohJQ3ywpjxk3XjJJhmTcpmww2jEOvMLbKN0EEwC8DWRXztsHWZpz945477BgbScbG1S5bWGWZ\neDYmGVtSXRiHPsBEadZBTFkVs4h/n51vWxcJjYj06d5h+29qNbHrwMNrbIr7niHs/DbivEuIGnev\nUpAXO5+28lAeeDa2miPGhUbeeESlM3a9b9Lca0wTkD2coVeM4PLpKFrNxppHYtPxwX2ytqPVbOCj\n75xYs7Tej2mZv+0Ym61xx8Vj11suxdc/9e7I/eKe/+ZrV3L7lxV/PiKITAMQ3McbW6A/VUOzIYCi\nL7VD2H2T5l7L6j4dFlxn6NTQK0ZUDPTG9Q2MtZqZV72Pa0fw/CYNG1jRpi/ZsM5YUeeRR59z1muT\nxoYffdytbnnc8x86dtbpnUBevG5DMzJEUQHj2BrTGS9p3/sal+pNSe+1otZpDBuUXCqGLQbaVfvM\nKi497v426WFZ1Vrlvt1LH+DyyJ00NtxVEknS310HHsZ1V23G61rrjKGjebLQ6Ub+IVFFX0jnpw+f\niPUOIix8M00VpLT3KeUaM5yhV4y0y6ezWn4dd3+bcwldjg84pyxIGlfvOntOcv72Qgd3H32qcGfu\n4fLHavrek30hnXEIq2gUdl2jUjekuU+LSqsxiNChV4y0ccBZxRHbloED/ZVvWs0GbrrmSmu7JpsE\n6HMuYY/crnHkQcJi37M4f9XpLqtVDnMhrKJRWKqAKOkkzX1KucYOHXrFSBsHnFUccdgycJM2e+fU\n1dZ2TTbFrZ4TPMf4aHM1FtybhfudiwCrceWu/fWfv06EzeRd+hp2TZJWQUpznzKrox1q6BUk7vJp\nk56YRf7pcUtaAZcKR0FN078E/9xCxxryZwrBvOPbp1ft8FK/eue60OniTZs2GMPuJn/j0lg2elQr\nEDE/vOsYVaEpTAYJSxUQRdI0AVml1agjdOgDTl5VYmbn23jhpcW+7c1G/5JxFxuC+5icefCRe3a+\njen7Tq6JkFnorCx9b8ir1ZPaCx1M33tyTWhk0IYoG5OGRQ4qwZQNwXH2E5YKIW0VrSSU0eagQMll\nwMlLT7SlHdi4fl3fHwoXG2xhgQ2R0BBMk5NZDsRLAyufTaGRYUvhs0j9O4i4pmzwCEuFUEaqAKYn\nsMMZegVwKWpskwjy0hNtx1/odPvKs5mKRwfPYTvfkupqSTrXvsWhvdAJzfJ4bqGD2fl2JbMr5oVp\nrG2pkQE3PbxoZ8qsjmbo0EvGJgWEFUL238h56Ym2825qNdfYa3PmQRvC0uHG7Vtcwl4Kjq5vOFVZ\nqhNxx5ra9OBAyaVkbFJAVPZAj7zSndrOK9JftciESzUbP7a+xanCE2dfj4uvLA2N1OJhHWtDumLT\nOxNSXejQSyZMinDZPy890XZe26rPIGHVbGyY+hanCo9X0ScOwxLREsQ41oF0xeOjTczcaE6zS6oJ\nk3OVhKdD2x5zbWF9Y60mNr5mXaiunuey6KgQN2DFEYyut9toO4cpHNK1ff+xYZp5kCpmTCwK27sL\nLquvHkzOVWFcqgCZVl42RwQvvrIYuuQ572XRLisqL3S6oe2nkYlcjnVdHWob52GhjPuH5Asdegm4\nVAEyrby0ZS2MCg/Mclm0i3QSjHYMtp9GJnI51iULYkPEOs5eatw6YRuJou8fki+RkouIfBXABwA8\nq6r/qLftUgCHAWwF8ASAj6lqZJ5SSi4rJK0m43JckZVq4lQPKqtSTtLxCKtGZMpDU3U8h16l+4e4\nk2XFov8J4M8AfM23bT+A76vqARHZ3/v8uSSGDiNJQw1djstzWXRQWw2LQQ8yZnixGUerjdo3+P3W\n17dw9PHzVsdrq7jknSOMQXPmACAhBTEUK+8mklSRcsXlWmeh3d82ewqHjp3FkioaIqvVpYaFSMlF\nVf8awPOBzR8GcLD380EAUxnbVWuSasgux+UVxmjSVl98ZbEv1K05ImgYwt9eeGkxsVYbta/p+x/+\n4vnQl51LqmvOlSbF7CAQVd3IG4frrtqc+f3jcq2z0O5vmz2Fu48+tXrdl1Rx99GncNvs8KwzSKqh\nv1FVnwGA3v+XZWdS/UmqIbscl1cYo63KzSUb1q1pa2bPDrz2Nf0Pfv6K9Lbz2bTaLJftm3T1sIpL\nw0TcKlKuJE0NEVe7P3TsbKz7xs0NAAALGUlEQVTtdST3laIisg/APgDYsqV+L5uSknTpcllLnm0y\nxMLFLm7/4NtXH5VnjpxJlQqg3VuK7+9j2L5hWrcJ26x9WEMXg5xb6GR+j7mkp8gihUXYtfUkpbqH\nXyadof9KRC4HgN7/z9p2VNW7VHVSVSc3b7ZnbSPZkFfYmU1D9VIB+NuzRVT4z7GpZV8sFLSXS8+L\nI4+xdqlOlEWlrbCopmEJv0zq0B8EsLf3814AD2RjDklLXmFncVIBKMxVjfw6bFhluKC9Va8mZFoy\nP4jklYK2qHc/UesPhiH8MtKhi8ghAP8XwHYReVpEPgngAID3iMjPAbyn95lUgLyyL8ZNBWCrOO8R\nlULAb2/R1YT88egubc7siZ9yoEzGWs01VZ/yTkFb1Lsfl/UHda9qxKX/NSPNsvok7PxP33WqahQM\nSbv4ymJkcWVvaTqwthLRuQIiUby25558fjUMzkZzBFhcHpxwRv+S/9n5Nr7w4OnV9x4b1zfQbIzg\nQqeLK8ZauO6qzXjk0edWxz74OU6q5yIp+vcgb1zj0OnQa4ap8k6r2chl9jU738b0vSf7ik00G7Im\nqZPJpuaIrKkwZMO0X2NEsGSIw9v1lkvx46curG2nIYChIIYLtnbqQKvZwEffOYHDPzqbaGyC5/Gn\neva2l1l0osjfgyJgLpchpchqLq5VjYwhj8uKjevXrUoVtsdkUyWipWXFxvWN1WMaIrj52i34+qfe\n3df3mRt3rEoiEtKOibo6c+DVUM00ztx/nqqlCxjWqkacodeYvB+DXZeJp93PRNyl6FHZLUn2pEkX\nUDUJp2w4Qx9yisia5xpqtqFpvs2C2+OEqMXZNyq7JUmH7aknaQgkMz4mhw69phSRNc811OzlxWXj\n8cHtpvM1R6SvElHccLakBaBNKQzIWmwpiNOEQDLjY3Lo0GtKXuGLflx1SptMG9xuOt/Mnh2rlYiS\naqFhfR4fbaLle1LwfPjEWAv/Zc+OWqbSdcEUumn6bEtBnEavLuLerSssEj2gRGmMeWZd9GNaJh60\nbcSS6c/0qG5bdh7lHMLGI6wA8uj6dbj9g2+3nn9q50Rk2GLdGB9tYv7zN/RtDxvjLNMF2K7X2GgT\nuw48TF09BM7QBxAXjTGvrItJbLO96XStLJSkTf942Aogw7CviWt/czwTOweF8xe7feNRpK5tlN4a\nghdeCq/WRejQBxIXjbGssC2TbcsAWs2RvjDDrPJUR43H1M4JXLLB/jAapc8+8ffD96gfHI8idW3T\nvbtx/bq+EEvq6v1QchlAXDXGMjIz2mx7qbucWcWb4KO/TU7x2+KSamB2vo07vn16zQrW8dFm5IrW\nOtJe6KyRN1zGOCk2Kcd/726zZNWkrr4WztAHkCwy0+VF3raZHv1dsjuaKib52dRqYvq+k33Oexid\nuUfcMU6Cq5RT5Xu+StChDyBl6eMu5G2b6dHfJbtj2DtNL2tkVBqCYcZljJPgKuVU+Z6vEnToA0iV\nlzXnbZvtETsqu+OFkNqnYVkjyatEjXES4siHVb3nqwQ19AGlrMpFLsSxLe4Sb5ueG5ZFb3a+jRER\na+jhLYdPrORn5wTdiS9//B2rGRb9OrstE2MYccJrq3zPVwXO0ElpJAmFi/vo7bURFUc+RGHmqbAV\n1W4vdHD30adihxVSSskWOnRSGklC4eI+eidd9k/suBbVdgkrpJSSLZRcSGkkXeId59GbYW354Lpy\n1mX8KaVkBx36kFBGOtK80hPE6UtYDDXJH4YVFgsllyGgjHSkeaUniNuXqheYrjPUwouHDn0IKCMd\naV7pCeL2JdhGnIpFJJqGSGgmRkopxULJZQgoIx1plukJ/BKLTbk1tec/blNrJU3uxa45NztJxrJq\nZikd6kjRUicd+hBQVCrdPNo0Ffu1tRd23ELIwiKSHGrkdoL3oCcPAtHpoJNCyWUIKCPWN6s2XcIO\npdde3ONIOqiRh1OG1MkZ+hDgzQaKfPTLqk0XWUjRP+NhuGL+UCMPpwypkw59SCgj1tfWZtZhhxOG\nx37XcMVGSEoAYmdirBXr3cfYaBOqKzl1wq65yzF56dJZn7cMqZOSCymUrMMObY/9LuGKzYYYCxyT\ncFykluB1Pn+xi4VON/SauxxjSjmQRQhuHqG9ZUiddOikUNKGHY61mhgfbUaGxpmOG/UVgx4fbWLm\nxh2rBY4ZzuhGQ8RJaol6h2G65i7HmFIOZKFL56F3l5HWIJXkIiLvBfAnABoA/lxVD2RiFaktSXTF\npHKR63FTOyfw6cMnYp9/GFlWdRpTF504uI/LMTZ5LK0unZfeXbTUmXiGLiINAP8NwD8F8DYAN4nI\n27IyjNSTqlaeKbv9QcF1nFz2C+7jcoztSSrt9avqfRmXNJLLuwA8pqqPq+orAL4B4MPZmEXqSlXT\npdY1RcCut1yaWb/iXKck7z5cjjG988ji/qnqfRmXNA59AsBZ3+ene9vWICL7RGROROaee+65FM2R\nOlDVdKkmu/xL2YMavIc3XzTtv3G92TnlodaPCPDWyzauzmAbIrj52i34+qfevdqvMBoi2PWWSzHW\nerX26sb1DYy1ot9XmAiO5/hoM/JcLsd47zyyvn+qel/GRTRhyJaI7AGwW1X/de/zHwB4l6r+e9sx\nk5OTOjc3l6g9QggZVkTkuKpORu2XZob+NIArfZ/fDOBcivMRQghJQRqH/jcA3ioi20RkPYBPAHgw\nG7MIIYTEJXHYoqouisi/A3AEK2GLX1XV05lZRgghJBap4tBV9TsAvpORLYQQQlLAlaKEEFITEke5\nJGpM5DkATxbWYPV4A4C/K9uIisMxCofjE05dx+c3VHVz1E6FOvRhR0TmXEKPhhmOUTgcn3CGfXwo\nuRBCSE2gQyeEkJpAh14sd5VtwADAMQqH4xPOUI8PNXRCCKkJnKETQkhNoEPPCRGZEZFHReQnIvIt\nERnzfXeriDwmImdEZLdv+3t72x4Tkf3lWF4MIrJHRE6LyLKITAa+G/rxCTLMffcjIl8VkWdF5Ke+\nbZeKyPdE5Oe9/8d720VE/rQ3Zj8Rkd8uz/KCUFX+y+EfgBsArOv9/CUAX+r9/DYAJwG8BsA2AL/A\nSuqERu/n3wSwvrfP28ruR47j81sAtgP4AYBJ33aOT/9YDW3fDWPxTwD8NoCf+rb9ZwD7ez/v9/2u\nvQ/AX2ElY/G1AI6VbX/e/zhDzwlV/a6qLvY+HsVKNkpgpQjIN1T1ZVX9JYDHsFIsZKgKhqjqz1TV\nVLCR49PPMPd9Dar61wCeD2z+MICDvZ8PApjybf+arnAUwJiIXF6MpeVAh14M/worMwXAXhjEqWDI\nEMDx6WeY++7CG1X1GQDo/X9Zb/vQjVuq5FzDjoj8bwBvMnz1h6r6QG+fPwSwCODr3mGG/RXmP64D\nHYLkMj6mwwzbajk+MbCNCQln6MaNDj0Fqvp7Yd+LyF4AHwDwu9oT9RBeGKRWBUOixsfC0IxPDFhM\nJpxficjlqvpMT1J5trd96MaNkktOiMh7AXwOwIdU9aLvqwcBfEJEXiMi2wC8FcCPwIIhHhyffoa5\n7y48CGBv7+e9AB7wbf/nvWiXawFc8KSZusIZen78GVYiNb4nK4V7j6rqv1HV0yJyD4C/xYoU829V\ndQkAhqlgiIj8PoD/CmAzgIdE5ISq7ub49KMsJrOKiBwC8DsA3iAiTwO4HcABAPeIyCcBPAVgT2/3\n72Al0uUxABcB/MvCDS4YrhQlhJCaQMmFEEJqAh06IYTUBDp0QgipCXTohBBSE+jQCSGkJtChE0JI\nTaBDJ4SQmkCHTgghNeH/A0QcOu3pSCSwAAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x1102cf9b0>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"# Is ranking difference correlated with score difference?\n", | |
"plt.scatter(df_concat['RankDiff'], df_concat['ScoreDiff']);" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Create testing and training sets\n", | |
"df_wins = pd.DataFrame()\n", | |
"df_wins['RankDiff'] = df_total['RankDiff']\n", | |
"df_wins['Result'] = 1\n", | |
"\n", | |
"df_losses = pd.DataFrame()\n", | |
"df_losses['RankDiff'] = -df_total['RankDiff']\n", | |
"df_losses['Result'] = 0" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>RankDiff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1.062500</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>-150.448529</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>14.294118</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>24.952206</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>-13.906250</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" RankDiff Result\n", | |
"0 1.062500 1\n", | |
"1 -150.448529 1\n", | |
"2 14.294118 1\n", | |
"3 24.952206 1\n", | |
"4 -13.906250 1" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_predictions = pd.concat((df_wins, df_losses))\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train = df_predictions['RankDiff'].values.reshape(-1,1)\n", | |
"Y_train = df_predictions['Result'].values\n", | |
"X_train, Y_train = shuffle(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Best log_loss: -0.5468, with best C: 0.0016681005372000592\n" | |
] | |
} | |
], | |
"source": [ | |
"# Fit and test model\n", | |
"logreg2 = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf.fit(X_train, Y_train)\n", | |
"print('Best log_loss: {:.4}, with best C: {}'.format(clf.best_score_, clf.best_params_['C']))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Type</th>\n", | |
" <th>Log Loss</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Seed Based Logistic Regression</td>\n", | |
" <td>-0.553150</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Avg. Ranking Based Logistic Regression</td>\n", | |
" <td>-0.546793</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Type Log Loss\n", | |
"0 Seed Based Logistic Regression -0.553150\n", | |
"0 Avg. Ranking Based Logistic Regression -0.546793" | |
] | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Store model results\n", | |
"df_results = df_results.append(pd.DataFrame({'Type': ['Avg. Ranking Based Logistic Regression'], 'Log Loss': [clf.best_score_]}, columns=['Type', 'Log Loss']))\n", | |
"df_results.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### FiveThirtyEight Elo Logistic Regression Implementation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Homecourt Bonus\n", | |
"HOME_ADVANTAGE = 100 \n", | |
"# Learning rate\n", | |
"K = 22" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>WScore</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>LScore</th>\n", | |
" <th>WLoc</th>\n", | |
" <th>NumOT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>20</td>\n", | |
" <td>1228</td>\n", | |
" <td>81</td>\n", | |
" <td>1328</td>\n", | |
" <td>64</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1106</td>\n", | |
" <td>77</td>\n", | |
" <td>1354</td>\n", | |
" <td>70</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1112</td>\n", | |
" <td>63</td>\n", | |
" <td>1223</td>\n", | |
" <td>56</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1165</td>\n", | |
" <td>70</td>\n", | |
" <td>1432</td>\n", | |
" <td>54</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1192</td>\n", | |
" <td>86</td>\n", | |
" <td>1447</td>\n", | |
" <td>74</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT\n", | |
"0 1985 20 1228 81 1328 64 N 0\n", | |
"1 1985 25 1106 77 1354 70 H 0\n", | |
"2 1985 25 1112 63 1223 56 H 0\n", | |
"3 1985 25 1165 70 1432 54 H 0\n", | |
"4 1985 25 1192 86 1447 74 H 0" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Load regular season data\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"rs = pd.read_csv(data_dir + 'RegularSeasonCompactResults.csv')\n", | |
"rs.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"364" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Teams\n", | |
"team_ids = set(rs.WTeamID).union(set(rs.LTeamID))\n", | |
"len(team_ids)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Score lookup dict\n", | |
"elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# New columns to help us iteratively update elos\n", | |
"rs['margin'] = rs.WScore - rs.LScore\n", | |
"rs['w_elo'] = None\n", | |
"rs['l_elo'] = None" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Elo calculation\n", | |
"def elo_pred(elo1, elo2):\n", | |
" return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))\n", | |
"\n", | |
"def expected_margin(elo_diff):\n", | |
" return((7.5 + 0.006 * elo_diff))\n", | |
"\n", | |
"def elo_update(w_elo, l_elo, margin):\n", | |
" elo_diff = w_elo - l_elo\n", | |
" pred = elo_pred(w_elo, l_elo)\n", | |
" mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)\n", | |
" update = K * mult * (1 - pred)\n", | |
" return(pred, update)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Check order\n", | |
"assert np.all(rs.index.values == np.array(range(rs.shape[0]))), \"Index is out of order.\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Iterate through all games\n", | |
"preds = []\n", | |
"for i in range(rs.shape[0]):\n", | |
" \n", | |
" # Get key data from current row\n", | |
" w = rs.at[i, 'WTeamID']\n", | |
" l = rs.at[i, 'LTeamID']\n", | |
" margin = rs.at[i, 'margin']\n", | |
" wloc = rs.at[i, 'WLoc']\n", | |
" \n", | |
" # Does either team get a home-court advantage?\n", | |
" w_ad, l_ad, = 0., 0.\n", | |
" if wloc == \"H\":\n", | |
" w_ad += HOME_ADVANTAGE\n", | |
" elif wloc == \"A\":\n", | |
" l_ad += HOME_ADVANTAGE\n", | |
" \n", | |
" # Get elo updates as a result of the game\n", | |
" pred, update = elo_update(elo_dict[w] + w_ad,\n", | |
" elo_dict[l] + l_ad, \n", | |
" margin)\n", | |
" elo_dict[w] += update\n", | |
" elo_dict[l] -= update\n", | |
" preds.append(pred)\n", | |
" \n", | |
" # Stores new elos in the games dataframe\n", | |
" rs.loc[i, 'w_elo'] = elo_dict[w]\n", | |
" rs.loc[i, 'l_elo'] = elo_dict[l]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def final_elo_per_season(df, team_id):\n", | |
" d = df.copy()\n", | |
" d = d.loc[(d.WTeamID == team_id) | (d.LTeamID == team_id), :]\n", | |
" d.sort_values(['Season', 'DayNum'], inplace=True)\n", | |
" d.drop_duplicates(['Season'], keep='last', inplace=True)\n", | |
" w_mask = d.WTeamID == team_id\n", | |
" l_mask = d.LTeamID == team_id\n", | |
" d['season_elo'] = None\n", | |
" d.loc[w_mask, 'season_elo'] = d.loc[w_mask, 'w_elo']\n", | |
" d.loc[l_mask, 'season_elo'] = d.loc[l_mask, 'l_elo']\n", | |
" out = pd.DataFrame({\n", | |
" 'team_id': team_id,\n", | |
" 'season': d.Season,\n", | |
" 'season_elo': d.season_elo\n", | |
" })\n", | |
" return(out)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"df_list = [final_elo_per_season(rs, i) for i in team_ids]\n", | |
"season_elos = pd.concat(df_list)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>season</th>\n", | |
" <th>season_elo</th>\n", | |
" <th>team_id</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>134286</th>\n", | |
" <td>2014</td>\n", | |
" <td>1317.05</td>\n", | |
" <td>1101</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>139681</th>\n", | |
" <td>2015</td>\n", | |
" <td>1201.11</td>\n", | |
" <td>1101</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>145038</th>\n", | |
" <td>2016</td>\n", | |
" <td>1213.74</td>\n", | |
" <td>1101</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>150369</th>\n", | |
" <td>2017</td>\n", | |
" <td>1233.86</td>\n", | |
" <td>1101</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3606</th>\n", | |
" <td>1985</td>\n", | |
" <td>1404.46</td>\n", | |
" <td>1102</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" season season_elo team_id\n", | |
"134286 2014 1317.05 1101\n", | |
"139681 2015 1201.11 1101\n", | |
"145038 2016 1213.74 1101\n", | |
"150369 2017 1233.86 1101\n", | |
"3606 1985 1404.46 1102" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"season_elos.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>W_Elo</th>\n", | |
" <th>L_Elo</th>\n", | |
" <th>Elo_Diff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>1116</td>\n", | |
" <td>1234</td>\n", | |
" <td>1591.58</td>\n", | |
" <td>1611.14</td>\n", | |
" <td>-19.5577</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>1120</td>\n", | |
" <td>1345</td>\n", | |
" <td>1571.38</td>\n", | |
" <td>1582.63</td>\n", | |
" <td>-11.2464</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>1207</td>\n", | |
" <td>1250</td>\n", | |
" <td>1748.49</td>\n", | |
" <td>1430.35</td>\n", | |
" <td>318.145</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>1229</td>\n", | |
" <td>1425</td>\n", | |
" <td>1582.04</td>\n", | |
" <td>1578.1</td>\n", | |
" <td>3.94023</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>1242</td>\n", | |
" <td>1325</td>\n", | |
" <td>1615.96</td>\n", | |
" <td>1600.98</td>\n", | |
" <td>14.9841</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season WTeamID LTeamID W_Elo L_Elo Elo_Diff\n", | |
"0 1985 1116 1234 1591.58 1611.14 -19.5577\n", | |
"1 1985 1120 1345 1571.38 1582.63 -11.2464\n", | |
"2 1985 1207 1250 1748.49 1430.35 318.145\n", | |
"3 1985 1229 1425 1582.04 1578.1 3.94023\n", | |
"4 1985 1242 1325 1615.96 1600.98 14.9841" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Logistic Regression\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['DayNum','WLoc', 'NumOT', 'WScore', 'LScore'], inplace=True, axis=1)\n", | |
"\n", | |
"df_win_elos = season_elos.rename(columns={'team_id':'WTeamID', 'season':'Season', 'season_elo':'W_Elo'}) #\n", | |
"df_loss_elos = season_elos.rename(columns={'team_id':'LTeamID', 'season':'Season', 'season_elo':'L_Elo'}) #\n", | |
"df_dummy = pd.merge(left=df_tour, right=df_win_elos, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_loss_elos, on=['Season', 'LTeamID'])\n", | |
"df_concat['Elo_Diff'] = df_concat['W_Elo'] - df_concat['L_Elo']\n", | |
"df_concat.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Prediction dataframe\n", | |
"df_wins = pd.DataFrame()\n", | |
"df_wins['Elo_Diff'] = df_concat['Elo_Diff']\n", | |
"df_wins['Result'] = 1\n", | |
"\n", | |
"df_losses = pd.DataFrame()\n", | |
"df_losses['Elo_Diff'] = -df_concat['Elo_Diff']\n", | |
"df_losses['Result'] = 0" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Elo_Diff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>-19.5577</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>-11.2464</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>318.145</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3.94023</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>14.9841</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Elo_Diff Result\n", | |
"0 -19.5577 1\n", | |
"1 -11.2464 1\n", | |
"2 318.145 1\n", | |
"3 3.94023 1\n", | |
"4 14.9841 1" | |
] | |
}, | |
"execution_count": 40, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_predictions = pd.concat((df_wins, df_losses))\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train = df_predictions['Elo_Diff'].values.reshape(-1,1)\n", | |
"Y_train = df_predictions['Result'].values\n", | |
"X_train, Y_train = shuffle(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Best log_loss: -0.5428, with best C: 0.0001291549665014884\n" | |
] | |
} | |
], | |
"source": [ | |
"# Fit and test model\n", | |
"logreg2 = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf.fit(X_train, Y_train)\n", | |
"print('Best log_loss: {:.4}, with best C: {}'.format(clf.best_score_, clf.best_params_['C']))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Type</th>\n", | |
" <th>Log Loss</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Seed Based Logistic Regression</td>\n", | |
" <td>-0.553150</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Avg. Ranking Based Logistic Regression</td>\n", | |
" <td>-0.546793</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>FiveThirtyEight Elo Logistic Regression</td>\n", | |
" <td>-0.542821</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Type Log Loss\n", | |
"0 Seed Based Logistic Regression -0.553150\n", | |
"0 Avg. Ranking Based Logistic Regression -0.546793\n", | |
"0 FiveThirtyEight Elo Logistic Regression -0.542821" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Store model results\n", | |
"df_results = df_results.append(pd.DataFrame({'Type': ['FiveThirtyEight Elo Logistic Regression'], 'Log Loss': [clf.best_score_]}, columns=['Type', 'Log Loss']))\n", | |
"df_results.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Select Ranking Systems" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>1116</td>\n", | |
" <td>1234</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>1120</td>\n", | |
" <td>1345</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>1207</td>\n", | |
" <td>1250</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>1229</td>\n", | |
" <td>1425</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>1242</td>\n", | |
" <td>1325</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season WTeamID LTeamID\n", | |
"0 1985 1116 1234\n", | |
"1 1985 1120 1345\n", | |
"2 1985 1207 1250\n", | |
"3 1985 1229 1425\n", | |
"4 1985 1242 1325" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['DayNum','WLoc', 'NumOT', 'WScore', 'LScore'], inplace=True, axis=1)\n", | |
"df_tour.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['SEL', 'AP', 'BIH', 'DUN', 'ENT', 'GRN', 'IMS', 'MAS', 'MKV', 'MOR', 'POM', 'RPI', 'SAG', 'SAU', 'SE', 'STR', 'USA', 'WLK', 'WOB', 'BOB', 'DWH', 'ERD', 'ECK', 'BRZ', 'ARG', 'RTH', 'WOL', 'HOL', 'COL', 'DOL', 'GRS', 'HER', 'TSR', 'WTE', 'BD', 'MGY', 'CNG', 'SIM', 'DES', 'JON', 'LYN', 'NOR', 'RM', 'REI', 'ACU', 'BCM', 'CMV', 'SAP', 'DC', 'KLK', 'WIL', 'ROH', 'RIS', 'REN', 'SCR', 'DOK', 'PIG', 'KPK', 'PKL', 'TRX', 'MB', 'JCI', 'PH', 'LYD', 'KRA', 'RTR', 'UCS', 'ISR', 'CPR', 'BKM', 'JEN', 'REW', 'STH', 'SPW', 'RSE', 'PGH', 'CPA', 'RTB', 'HKB', 'BPI', 'TW', 'NOL', 'DC2', 'DCI', 'OMY', 'LMC', 'RT', 'KEL', 'KMV', 'RTP', 'TMR', 'AUS', 'ROG', 'PTS', 'KOS', 'PEQ', 'ADE', 'BNM', 'CJB', 'BUR', 'HAT', 'MSX', 'BBT', '7OT', 'SFX', 'EBP', 'TBD', 'CRO', 'D1A', 'TPR', 'BLS', 'DII', 'KBM', 'TRP', 'LOG', 'SP', 'STF', 'WMR', 'PPR', 'STS', 'UPS', 'SPR', 'MvG', 'TRK', 'BWE', 'HAS', 'FSH', 'DAV', 'KPI', 'FAS', 'MCL', 'HRN', 'RSL', 'SMN', 'DDB', 'INP', 'JRT', 'ESR', 'FMG', 'PRR', 'SMS', 'HKS', 'MUZ', 'OCT', 'SGR', 'ZAM', 'JNG', 'CRW', 'PMC', 'YAG']\n" | |
] | |
} | |
], | |
"source": [ | |
"# Get list of all ranking systems\n", | |
"ranking_types = df_massey['SystemName'].unique().tolist()\n", | |
"ranking_types = [e for e in ranking_types if e not in ('MIC', 'GC', 'RAG', 'TOL', 'EBB', 'BP5', 'MPI', 'BOW', 'CTL')]\n", | |
"print(ranking_types)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Iterate through each ranking and check log loss\n", | |
"def logreg_type(mytype):\n", | |
" df_type = df_massey.loc[(df_massey['RankingDayNum'] == final_day) & (df_massey['SystemName'] == mytype)]\n", | |
" df_type = df_type.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
" df_type.rename(columns={'OrdinalRank':'Type Rank'}, inplace=True)\n", | |
"\n", | |
" df_win_ranks = df_type.rename(columns={'TeamID':'WTeamID', 'Type Rank':'WTypeRank'})\n", | |
" df_loss_ranks = df_type.rename(columns={'TeamID':'LTeamID', 'Type Rank':'LTypeRank'})\n", | |
" df_dummy = pd.merge(left=df_tour, right=df_win_ranks, how='left', on=['Season', 'WTeamID'])\n", | |
" df_concat = pd.merge(left=df_dummy, right=df_loss_ranks, on=['Season', 'LTeamID'])\n", | |
" df_concat['RankDiff'] = df_concat['WTypeRank'] - df_concat['LTypeRank']\n", | |
" df_total = df_concat[['Season', 'WTeamID','LTeamID', 'WTypeRank', 'LTypeRank', 'RankDiff']]\n", | |
" \n", | |
" if len(df_total) > 980:\n", | |
" df_wins = pd.DataFrame()\n", | |
" df_wins['RankDiff'] = df_total['RankDiff']\n", | |
" df_wins['Result'] = 1\n", | |
" df_losses = pd.DataFrame()\n", | |
" df_losses['RankDiff'] = -df_total['RankDiff']\n", | |
" df_losses['Result'] = 0\n", | |
"\n", | |
" df_predictions = pd.concat((df_wins, df_losses))\n", | |
"\n", | |
" X_train = df_predictions['RankDiff'].values.reshape(-1,1)\n", | |
" Y_train = df_predictions['Result'].values\n", | |
" X_train, Y_train = shuffle(X_train, Y_train)\n", | |
" if np.isnan(np.sum(X_train)) == False:\n", | |
"\n", | |
" logregtype = LogisticRegression()\n", | |
" params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
" clf = GridSearchCV(logregtype, params, scoring='neg_log_loss', refit=True)\n", | |
" clf.fit(X_train, Y_train)\n", | |
"\n", | |
" print('{} - Best log_loss: {:.4}, with best C: {}'.format(mytype, clf.best_score_, clf.best_params_['C']))\n", | |
" return(pd.DataFrame({'Type': [mytype], 'Log Loss': [clf.best_score_]}, columns=['Type', 'Log Loss']))\n", | |
" return(pd.DataFrame({'Type': [mytype], 'Log Loss': [999]}, columns=['Type', 'Log Loss']))\n", | |
" return(pd.DataFrame({'Type': [mytype], 'Log Loss': [999]}, columns=['Type', 'Log Loss']))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"MOR - Best log_loss: -0.5515, with best C: 0.0001291549665014884\n", | |
"POM - Best log_loss: -0.5514, with best C: 0.0001291549665014884\n", | |
"RPI - Best log_loss: -0.5583, with best C: 0.0001291549665014884\n", | |
"SAG - Best log_loss: -0.5491, with best C: 0.0001291549665014884\n", | |
"WLK - Best log_loss: -0.5523, with best C: 0.0001291549665014884\n", | |
"RTH - Best log_loss: -0.5557, with best C: 0.0001291549665014884\n", | |
"WOL - Best log_loss: -0.5573, with best C: 0.0016681005372000592\n", | |
"COL - Best log_loss: -0.5591, with best C: 0.0001291549665014884\n", | |
"DOL - Best log_loss: -0.5571, with best C: 0.0001291549665014884\n" | |
] | |
} | |
], | |
"source": [ | |
"df_type_scores = pd.DataFrame(columns=['Type', 'Log Loss'])\n", | |
"for mytype in ranking_types:\n", | |
" df_type_scores = df_type_scores.append(logreg_type(mytype))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Type</th>\n", | |
" <th>Log Loss</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>SAG</td>\n", | |
" <td>-0.549115</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>POM</td>\n", | |
" <td>-0.551438</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>MOR</td>\n", | |
" <td>-0.551542</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>WLK</td>\n", | |
" <td>-0.552273</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>RTH</td>\n", | |
" <td>-0.555652</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>DOL</td>\n", | |
" <td>-0.557051</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>WOL</td>\n", | |
" <td>-0.557305</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>RPI</td>\n", | |
" <td>-0.558276</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>COL</td>\n", | |
" <td>-0.559128</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Type Log Loss\n", | |
"0 SAG -0.549115\n", | |
"0 POM -0.551438\n", | |
"0 MOR -0.551542\n", | |
"0 WLK -0.552273\n", | |
"0 RTH -0.555652\n", | |
"0 DOL -0.557051\n", | |
"0 WOL -0.557305\n", | |
"0 RPI -0.558276\n", | |
"0 COL -0.559128" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_type_scores = df_type_scores.loc[df_type_scores['Log Loss'] != 999]\n", | |
"df_type_scores.sort_values(by='Log Loss', ascending=False, inplace=True)\n", | |
"df_type_scores" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Final Model Selection\n", | |
"Right now, I have a couple different metric options to test, tune, and consider for use in the upcoming tournament:\n", | |
"1. FiveThirtyEight Elo Ratings\n", | |
"2. Average Select Ranking Systems\n", | |
"3. Composite Model" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. FiveThirtyEight Elo Ratings Model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>W_Elo</th>\n", | |
" <th>L_Elo</th>\n", | |
" <th>Elo_Diff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1116</td>\n", | |
" <td>1234</td>\n", | |
" <td>1591.58</td>\n", | |
" <td>1611.14</td>\n", | |
" <td>-19.5577</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1120</td>\n", | |
" <td>1345</td>\n", | |
" <td>1571.38</td>\n", | |
" <td>1582.63</td>\n", | |
" <td>-11.2464</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1207</td>\n", | |
" <td>1250</td>\n", | |
" <td>1748.49</td>\n", | |
" <td>1430.35</td>\n", | |
" <td>318.145</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1229</td>\n", | |
" <td>1425</td>\n", | |
" <td>1582.04</td>\n", | |
" <td>1578.1</td>\n", | |
" <td>3.94023</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1242</td>\n", | |
" <td>1325</td>\n", | |
" <td>1615.96</td>\n", | |
" <td>1600.98</td>\n", | |
" <td>14.9841</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID W_Elo L_Elo Elo_Diff\n", | |
"0 1985 136 1116 1234 1591.58 1611.14 -19.5577\n", | |
"1 1985 136 1120 1345 1571.38 1582.63 -11.2464\n", | |
"2 1985 136 1207 1250 1748.49 1430.35 318.145\n", | |
"3 1985 136 1229 1425 1582.04 1578.1 3.94023\n", | |
"4 1985 136 1242 1325 1615.96 1600.98 14.9841" | |
] | |
}, | |
"execution_count": 49, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# How does Elo perform alone?\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['WLoc', 'NumOT', 'WScore', 'LScore'], inplace=True, axis=1)\n", | |
"\n", | |
"df_win_elos = season_elos.rename(columns={'team_id':'WTeamID', 'season_elo':'W_Elo', 'season':'Season'})\n", | |
"df_loss_elos = season_elos.rename(columns={'team_id':'LTeamID', 'season_elo':'L_Elo', 'season':'Season'}) \n", | |
"df_dummy = pd.merge(left=df_tour, right=df_win_elos, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_loss_elos, on=['Season', 'LTeamID'])\n", | |
"df_concat['Elo_Diff'] = df_concat['W_Elo'] - df_concat['L_Elo']\n", | |
"df_concat.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>Elo_Diff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1116</td>\n", | |
" <td>1234</td>\n", | |
" <td>-19.5577</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1120</td>\n", | |
" <td>1345</td>\n", | |
" <td>-11.2464</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1207</td>\n", | |
" <td>1250</td>\n", | |
" <td>318.145</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1229</td>\n", | |
" <td>1425</td>\n", | |
" <td>3.94023</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>136</td>\n", | |
" <td>1242</td>\n", | |
" <td>1325</td>\n", | |
" <td>14.9841</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID Elo_Diff Result\n", | |
"0 1985 136 1116 1234 -19.5577 1\n", | |
"1 1985 136 1120 1345 -11.2464 1\n", | |
"2 1985 136 1207 1250 318.145 1\n", | |
"3 1985 136 1229 1425 3.94023 1\n", | |
"4 1985 136 1242 1325 14.9841 1" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Prediction dataframe\n", | |
"df_wins = pd.DataFrame()\n", | |
"df_wins['Season'] = df_concat['Season']\n", | |
"df_wins['DayNum'] = df_concat['DayNum']\n", | |
"df_wins['WTeamID'] = df_concat['WTeamID']\n", | |
"df_wins['LTeamID'] = df_concat['LTeamID']\n", | |
"\n", | |
"df_wins['Elo_Diff'] = df_concat['Elo_Diff']\n", | |
"df_wins['Result'] = 1\n", | |
"\n", | |
"df_losses = pd.DataFrame()\n", | |
"df_losses['Season'] = df_concat['Season']\n", | |
"df_losses['DayNum'] = df_concat['DayNum']\n", | |
"df_losses['WTeamID'] = df_concat['WTeamID']\n", | |
"df_losses['LTeamID'] = df_concat['LTeamID']\n", | |
"\n", | |
"df_losses['Elo_Diff'] = -df_concat['Elo_Diff']\n", | |
"df_losses['Result'] = 0\n", | |
"\n", | |
"df_predictions = pd.concat((df_wins, df_losses))\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4158" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Remove play-in games\n", | |
"df_predictions = df_predictions.loc[df_predictions['DayNum'] > 135]\n", | |
"len(df_predictions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-0.54538007699370672" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Testing and training sets\n", | |
"df_train = df_predictions.loc[df_predictions['Season'] < 2014]\n", | |
"df_test = df_predictions.loc[df_predictions['Season'] >= 2014]\n", | |
"\n", | |
"X_train = df_train['Elo_Diff'].values.reshape(-1,1)\n", | |
"Y_train = df_train['Result'].values\n", | |
"\n", | |
"X_test = df_test['Elo_Diff'].values.reshape(-1,1)\n", | |
"Y_test = df_test['Result'].values\n", | |
"\n", | |
"logreg = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf.fit(X_train, Y_train)\n", | |
"clf.score(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>Elo_Diff</th>\n", | |
" <th>Result</th>\n", | |
" <th>Elo_Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>2112</th>\n", | |
" <td>2017</td>\n", | |
" <td>146</td>\n", | |
" <td>1314</td>\n", | |
" <td>1246</td>\n", | |
" <td>17.9249</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2113</th>\n", | |
" <td>2017</td>\n", | |
" <td>146</td>\n", | |
" <td>1376</td>\n", | |
" <td>1196</td>\n", | |
" <td>144.711</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2114</th>\n", | |
" <td>2017</td>\n", | |
" <td>152</td>\n", | |
" <td>1211</td>\n", | |
" <td>1376</td>\n", | |
" <td>-242.598</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2115</th>\n", | |
" <td>2017</td>\n", | |
" <td>152</td>\n", | |
" <td>1314</td>\n", | |
" <td>1332</td>\n", | |
" <td>-45.0282</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2116</th>\n", | |
" <td>2017</td>\n", | |
" <td>154</td>\n", | |
" <td>1314</td>\n", | |
" <td>1211</td>\n", | |
" <td>-10.9314</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID Elo_Diff Result Elo_Pred\n", | |
"2112 2017 146 1314 1246 17.9249 0 1\n", | |
"2113 2017 146 1376 1196 144.711 0 1\n", | |
"2114 2017 152 1211 1376 -242.598 0 0\n", | |
"2115 2017 152 1314 1332 -45.0282 0 0\n", | |
"2116 2017 154 1314 1211 -10.9314 0 0" | |
] | |
}, | |
"execution_count": 53, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# 2014-2017 results\n", | |
"Y_pred = clf.predict(X_test)\n", | |
"df_test['Elo_Pred'] = Y_pred\n", | |
"df_test.tail()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Confusion Matrix: \n", | |
"[[185 67]\n", | |
" [ 67 185]] \n", | |
"\n", | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.73 0.73 0.73 252\n", | |
" 1 0.73 0.73 0.73 252\n", | |
"\n", | |
"avg / total 0.73 0.73 0.73 504\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# More results\n", | |
"print('Confusion Matrix: ')\n", | |
"print(confusion_matrix(Y_test, Y_pred), '\\n')\n", | |
"print(classification_report(Y_test, Y_pred))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2. Average Select Ranking Systems\n", | |
"We will pull and average the top performing rankings from our analysis before:\n", | |
"1. SAG \n", | |
"2. WLK\n", | |
"3. POM\n", | |
"4. MOR" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Pull various system rankings\n", | |
"df_topranks = season_elos.loc[:, season_elos.columns != 'Elo']\n", | |
"df_topranks = df_topranks.rename(columns={'team_id':'Team_ID', 'season':'Season'}) \n", | |
"\n", | |
"df_temp = df_massey.loc[(df_massey['RankingDayNum'] == final_day) & (df_massey['SystemName'] == 'SAG')]\n", | |
"df_temp = df_temp.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
"df_temp.rename(columns={'OrdinalRank':'SAG', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
"df_temp2 = df_massey.loc[(df_massey['RankingDayNum'] == final_day) & (df_massey['SystemName'] == 'WLK')]\n", | |
"df_temp2 = df_temp2.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
"df_temp2.rename(columns={'OrdinalRank':'WLK', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
"df_temp3 = df_massey.loc[(df_massey['RankingDayNum'] == final_day) & (df_massey['SystemName'] == 'POM')]\n", | |
"df_temp3 = df_temp3.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
"df_temp3.rename(columns={'OrdinalRank':'POM', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
"df_temp4 = df_massey.loc[(df_massey['RankingDayNum'] == final_day) & (df_massey['SystemName'] == 'MOR')]\n", | |
"df_temp4 = df_temp4.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
"df_temp4.rename(columns={'OrdinalRank':'MOR', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
"df_topranks = pd.merge(left=df_topranks, right=df_temp, how='left', on=['Season', 'Team_ID'])\n", | |
"df_topranks = pd.merge(left=df_topranks, right=df_temp2, how='left', on=['Season', 'Team_ID'])\n", | |
"df_topranks = pd.merge(left=df_topranks, right=df_temp3, how='left', on=['Season', 'Team_ID'])\n", | |
"df_topranks = pd.merge(left=df_topranks, right=df_temp4, how='left', on=['Season', 'Team_ID'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>season_elo</th>\n", | |
" <th>Team_ID</th>\n", | |
" <th>SAG</th>\n", | |
" <th>WLK</th>\n", | |
" <th>POM</th>\n", | |
" <th>MOR</th>\n", | |
" <th>MeanRank</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014</td>\n", | |
" <td>1317.05</td>\n", | |
" <td>1101</td>\n", | |
" <td>346.0</td>\n", | |
" <td>330.0</td>\n", | |
" <td>348.0</td>\n", | |
" <td>349.0</td>\n", | |
" <td>343.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2015</td>\n", | |
" <td>1201.11</td>\n", | |
" <td>1101</td>\n", | |
" <td>336.0</td>\n", | |
" <td>332.0</td>\n", | |
" <td>332.0</td>\n", | |
" <td>346.0</td>\n", | |
" <td>336.50</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2016</td>\n", | |
" <td>1213.74</td>\n", | |
" <td>1101</td>\n", | |
" <td>320.0</td>\n", | |
" <td>304.0</td>\n", | |
" <td>318.0</td>\n", | |
" <td>311.0</td>\n", | |
" <td>313.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2017</td>\n", | |
" <td>1233.86</td>\n", | |
" <td>1101</td>\n", | |
" <td>305.0</td>\n", | |
" <td>307.0</td>\n", | |
" <td>300.0</td>\n", | |
" <td>317.0</td>\n", | |
" <td>307.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>2003</td>\n", | |
" <td>1452.53</td>\n", | |
" <td>1102</td>\n", | |
" <td>149.0</td>\n", | |
" <td>165.0</td>\n", | |
" <td>160.0</td>\n", | |
" <td>132.0</td>\n", | |
" <td>151.50</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season season_elo Team_ID SAG WLK POM MOR MeanRank\n", | |
"0 2014 1317.05 1101 346.0 330.0 348.0 349.0 343.25\n", | |
"1 2015 1201.11 1101 336.0 332.0 332.0 346.0 336.50\n", | |
"2 2016 1213.74 1101 320.0 304.0 318.0 311.0 313.25\n", | |
"3 2017 1233.86 1101 305.0 307.0 300.0 317.0 307.25\n", | |
"22 2003 1452.53 1102 149.0 165.0 160.0 132.0 151.50" | |
] | |
}, | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Mean of all four systems\n", | |
"df_topranks['MeanRank'] = (df_topranks['SAG'] + df_topranks['WLK'] + df_topranks['POM'] + df_topranks['MOR']) / 4\n", | |
"df_topranks.dropna(inplace = True)\n", | |
"df_topranks.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>season_elo_x</th>\n", | |
" <th>W_MeanRank</th>\n", | |
" <th>season_elo_y</th>\n", | |
" <th>L_MeanRank</th>\n", | |
" <th>MeanRank_Diff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>1318.06</td>\n", | |
" <td>259.50</td>\n", | |
" <td>1288.79</td>\n", | |
" <td>264.50</td>\n", | |
" <td>-5.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>2051.08</td>\n", | |
" <td>2.75</td>\n", | |
" <td>1442.8</td>\n", | |
" <td>160.50</td>\n", | |
" <td>-157.75</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>1787.95</td>\n", | |
" <td>30.00</td>\n", | |
" <td>1833.37</td>\n", | |
" <td>22.00</td>\n", | |
" <td>8.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>1663.71</td>\n", | |
" <td>45.00</td>\n", | |
" <td>1835.58</td>\n", | |
" <td>24.25</td>\n", | |
" <td>20.75</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>1862.13</td>\n", | |
" <td>39.00</td>\n", | |
" <td>1825.56</td>\n", | |
" <td>44.00</td>\n", | |
" <td>-5.00</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID season_elo_x W_MeanRank season_elo_y \\\n", | |
"0 2003 134 1421 1411 1318.06 259.50 1288.79 \n", | |
"1 2003 136 1112 1436 2051.08 2.75 1442.8 \n", | |
"2 2003 136 1113 1272 1787.95 30.00 1833.37 \n", | |
"3 2003 136 1141 1166 1663.71 45.00 1835.58 \n", | |
"4 2003 136 1143 1301 1862.13 39.00 1825.56 \n", | |
"\n", | |
" L_MeanRank MeanRank_Diff \n", | |
"0 264.50 -5.00 \n", | |
"1 160.50 -157.75 \n", | |
"2 22.00 8.00 \n", | |
"3 24.25 20.75 \n", | |
"4 44.00 -5.00 " | |
] | |
}, | |
"execution_count": 57, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Join with tournament dataframe\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['WLoc', 'NumOT', 'WScore', 'LScore'], inplace=True, axis=1)\n", | |
"df_topranks.drop(labels=['SAG', 'WLK', 'POM', 'MOR'], inplace=True, axis=1)\n", | |
"\n", | |
"df_win_elos = df_topranks.rename(columns={'Team_ID':'WTeamID', 'MeanRank':'W_MeanRank'})\n", | |
"df_loss_elos = df_topranks.rename(columns={'Team_ID':'LTeamID', 'MeanRank':'L_MeanRank'}) \n", | |
"df_dummy = pd.merge(left=df_tour, right=df_win_elos, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_loss_elos, on=['Season', 'LTeamID'])\n", | |
"df_concat['MeanRank_Diff'] = df_concat['W_MeanRank'] - df_concat['L_MeanRank']\n", | |
"df_concat.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>MeanRank_Diff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>-5.00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>-157.75</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>8.00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>20.75</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>-5.00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID MeanRank_Diff Result\n", | |
"0 2003 134 1421 1411 -5.00 1\n", | |
"1 2003 136 1112 1436 -157.75 1\n", | |
"2 2003 136 1113 1272 8.00 1\n", | |
"3 2003 136 1141 1166 20.75 1\n", | |
"4 2003 136 1143 1301 -5.00 1" | |
] | |
}, | |
"execution_count": 58, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Prediction dataframe\n", | |
"df_wins = pd.DataFrame()\n", | |
"df_wins['Season'] = df_concat['Season']\n", | |
"df_wins['DayNum'] = df_concat['DayNum']\n", | |
"df_wins['WTeamID'] = df_concat['WTeamID']\n", | |
"df_wins['LTeamID'] = df_concat['LTeamID']\n", | |
"\n", | |
"df_wins['MeanRank_Diff'] = df_concat['MeanRank_Diff']\n", | |
"df_wins['Result'] = 1\n", | |
"\n", | |
"df_losses = pd.DataFrame()\n", | |
"df_losses['Season'] = df_concat['Season']\n", | |
"df_losses['DayNum'] = df_concat['DayNum']\n", | |
"df_losses['WTeamID'] = df_concat['WTeamID']\n", | |
"df_losses['LTeamID'] = df_concat['LTeamID']\n", | |
"\n", | |
"df_losses['MeanRank_Diff'] = -df_concat['MeanRank_Diff']\n", | |
"df_losses['Result'] = 0\n", | |
"\n", | |
"df_predictions = pd.concat((df_wins, df_losses))\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1890" | |
] | |
}, | |
"execution_count": 59, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Remove play-in games\n", | |
"df_predictions = df_predictions.loc[df_predictions['DayNum'] > 135]\n", | |
"len(df_predictions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-0.5450159995753735" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Testing and training sets\n", | |
"df_train = df_predictions.loc[df_predictions['Season'] < 2014]\n", | |
"df_test = df_predictions.loc[df_predictions['Season'] >= 2014]\n", | |
"\n", | |
"X_train = df_train['MeanRank_Diff'].values.reshape(-1,1)\n", | |
"Y_train = df_train['Result'].values\n", | |
"\n", | |
"X_test = df_test['MeanRank_Diff'].values.reshape(-1,1)\n", | |
"Y_test = df_test['Result'].values\n", | |
"\n", | |
"logreg = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf2 = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf2.fit(X_train, Y_train)\n", | |
"clf2.score(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Confusion Matrix: \n", | |
"[[185 67]\n", | |
" [ 67 185]] \n", | |
"\n", | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.73 0.73 0.73 252\n", | |
" 1 0.73 0.73 0.73 252\n", | |
"\n", | |
"avg / total 0.73 0.73 0.73 504\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# More results\n", | |
"print('Confusion Matrix: ')\n", | |
"print(confusion_matrix(Y_test, Y_pred), '\\n')\n", | |
"print(classification_report(Y_test, Y_pred))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3. Composite Model\n", | |
"Standardize the elo ratings and rankings and take the mean for logistic regression." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>Elo</th>\n", | |
" <th>Team_ID</th>\n", | |
" <th>season_elo</th>\n", | |
" <th>MeanRank</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014</td>\n", | |
" <td>1317.05</td>\n", | |
" <td>1101</td>\n", | |
" <td>1317.05</td>\n", | |
" <td>343.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2015</td>\n", | |
" <td>1201.11</td>\n", | |
" <td>1101</td>\n", | |
" <td>1201.11</td>\n", | |
" <td>336.50</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2016</td>\n", | |
" <td>1213.74</td>\n", | |
" <td>1101</td>\n", | |
" <td>1213.74</td>\n", | |
" <td>313.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2017</td>\n", | |
" <td>1233.86</td>\n", | |
" <td>1101</td>\n", | |
" <td>1233.86</td>\n", | |
" <td>307.25</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>2003</td>\n", | |
" <td>1452.53</td>\n", | |
" <td>1102</td>\n", | |
" <td>1452.53</td>\n", | |
" <td>151.50</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season Elo Team_ID season_elo MeanRank\n", | |
"0 2014 1317.05 1101 1317.05 343.25\n", | |
"1 2015 1201.11 1101 1201.11 336.50\n", | |
"2 2016 1213.74 1101 1213.74 313.25\n", | |
"3 2017 1233.86 1101 1233.86 307.25\n", | |
"22 2003 1452.53 1102 1452.53 151.50" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Set up and drop null rows\n", | |
"season_elos = season_elos.rename(columns={'team_id':'Team_ID', 'season':'Season', 'season_elo':'Elo'}) \n", | |
"df = pd.merge(left=season_elos, right=df_topranks, how='left', on=['Season', 'Team_ID'])\n", | |
"df.dropna(inplace=True)\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Normalize features\n", | |
"scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n", | |
"df['Elo_Scaled'] = scaler.fit_transform(df['Elo'].values.reshape(-1,1))\n", | |
"df['MeanRank_Scaled'] = 1 - scaler.fit_transform(df['MeanRank'].values.reshape(-1,1))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>Elo</th>\n", | |
" <th>Team_ID</th>\n", | |
" <th>season_elo</th>\n", | |
" <th>MeanRank</th>\n", | |
" <th>Elo_Scaled</th>\n", | |
" <th>MeanRank_Scaled</th>\n", | |
" <th>Composite Score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014</td>\n", | |
" <td>1317.05</td>\n", | |
" <td>1101</td>\n", | |
" <td>1317.05</td>\n", | |
" <td>343.25</td>\n", | |
" <td>0.377452</td>\n", | |
" <td>0.022143</td>\n", | |
" <td>0.199798</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2015</td>\n", | |
" <td>1201.11</td>\n", | |
" <td>1101</td>\n", | |
" <td>1201.11</td>\n", | |
" <td>336.50</td>\n", | |
" <td>0.289849</td>\n", | |
" <td>0.041429</td>\n", | |
" <td>0.165639</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2016</td>\n", | |
" <td>1213.74</td>\n", | |
" <td>1101</td>\n", | |
" <td>1213.74</td>\n", | |
" <td>313.25</td>\n", | |
" <td>0.299388</td>\n", | |
" <td>0.107857</td>\n", | |
" <td>0.203622</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2017</td>\n", | |
" <td>1233.86</td>\n", | |
" <td>1101</td>\n", | |
" <td>1233.86</td>\n", | |
" <td>307.25</td>\n", | |
" <td>0.314596</td>\n", | |
" <td>0.125000</td>\n", | |
" <td>0.219798</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>2003</td>\n", | |
" <td>1452.53</td>\n", | |
" <td>1102</td>\n", | |
" <td>1452.53</td>\n", | |
" <td>151.50</td>\n", | |
" <td>0.479827</td>\n", | |
" <td>0.570000</td>\n", | |
" <td>0.524914</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season Elo Team_ID season_elo MeanRank Elo_Scaled \\\n", | |
"0 2014 1317.05 1101 1317.05 343.25 0.377452 \n", | |
"1 2015 1201.11 1101 1201.11 336.50 0.289849 \n", | |
"2 2016 1213.74 1101 1213.74 313.25 0.299388 \n", | |
"3 2017 1233.86 1101 1233.86 307.25 0.314596 \n", | |
"22 2003 1452.53 1102 1452.53 151.50 0.479827 \n", | |
"\n", | |
" MeanRank_Scaled Composite Score \n", | |
"0 0.022143 0.199798 \n", | |
"1 0.041429 0.165639 \n", | |
"2 0.107857 0.203622 \n", | |
"3 0.125000 0.219798 \n", | |
"22 0.570000 0.524914 " | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Model 1\n", | |
"df['Composite Score'] = (df['Elo_Scaled'] + (2 * df['MeanRank_Scaled'])) / 3\n", | |
"df.head()\n", | |
"\n", | |
"# Model 2\n", | |
"#df['Composite Score'] = (df['Elo_Scaled'] + (df['MeanRank_Scaled'])) / 2\n", | |
"#df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>W_Composite</th>\n", | |
" <th>L_Composite</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>0.319824</td>\n", | |
" <td>0.301622</td>\n", | |
" <td>0.018201</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>0.963552</td>\n", | |
" <td>0.508381</td>\n", | |
" <td>0.455171</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>0.825212</td>\n", | |
" <td>0.853798</td>\n", | |
" <td>-0.028586</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>0.756841</td>\n", | |
" <td>0.851419</td>\n", | |
" <td>-0.094578</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>0.840379</td>\n", | |
" <td>0.819419</td>\n", | |
" <td>0.020960</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID W_Composite L_Composite Composite_Diff\n", | |
"0 2003 134 1421 1411 0.319824 0.301622 0.018201\n", | |
"1 2003 136 1112 1436 0.963552 0.508381 0.455171\n", | |
"2 2003 136 1113 1272 0.825212 0.853798 -0.028586\n", | |
"3 2003 136 1141 1166 0.756841 0.851419 -0.094578\n", | |
"4 2003 136 1143 1301 0.840379 0.819419 0.020960" | |
] | |
}, | |
"execution_count": 65, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Join tournament dataframe\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['WLoc', 'NumOT', 'WScore', 'LScore'], inplace=True, axis=1)\n", | |
"df.drop(labels=['Elo', 'season_elo', 'MeanRank'], inplace=True, axis=1)\n", | |
"\n", | |
"df_win_elos = df.rename(columns={'Team_ID':'WTeamID', 'Composite Score':'W_Composite'})\n", | |
"df_loss_elos = df.rename(columns={'Team_ID':'LTeamID', 'Composite Score':'L_Composite'}) \n", | |
"df_dummy = pd.merge(left=df_tour, right=df_win_elos, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_loss_elos, on=['Season', 'LTeamID'])\n", | |
"df_concat['Composite_Diff'] = df_concat['W_Composite'] - df_concat['L_Composite']\n", | |
"df_total = df_concat[['Season', 'DayNum', 'WTeamID', 'LTeamID', 'W_Composite', 'L_Composite', 'Composite_Diff']]\n", | |
"df_total.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>0.018201</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>0.455171</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>-0.028586</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>-0.094578</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>0.020960</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID Composite_Diff Result\n", | |
"0 2003 134 1421 1411 0.018201 1\n", | |
"1 2003 136 1112 1436 0.455171 1\n", | |
"2 2003 136 1113 1272 -0.028586 1\n", | |
"3 2003 136 1141 1166 -0.094578 1\n", | |
"4 2003 136 1143 1301 0.020960 1" | |
] | |
}, | |
"execution_count": 66, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Prediction dataframe\n", | |
"df_wins = pd.DataFrame()\n", | |
"df_wins['Season'] = df_concat['Season']\n", | |
"df_wins['DayNum'] = df_concat['DayNum']\n", | |
"df_wins['WTeamID'] = df_concat['WTeamID']\n", | |
"df_wins['LTeamID'] = df_concat['LTeamID']\n", | |
"\n", | |
"df_wins['Composite_Diff'] = df_concat['Composite_Diff']\n", | |
"df_wins['Result'] = 1\n", | |
"\n", | |
"df_losses = pd.DataFrame()\n", | |
"df_losses['Season'] = df_concat['Season']\n", | |
"df_losses['DayNum'] = df_concat['DayNum']\n", | |
"df_losses['WTeamID'] = df_concat['WTeamID']\n", | |
"df_losses['LTeamID'] = df_concat['LTeamID']\n", | |
"\n", | |
"df_losses['Composite_Diff'] = -df_concat['Composite_Diff']\n", | |
"df_losses['Result'] = 0\n", | |
"\n", | |
"df_predictions = pd.concat((df_wins, df_losses))\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 67, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1890" | |
] | |
}, | |
"execution_count": 67, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Remove play-in games\n", | |
"df_predictions = df_predictions.loc[df_predictions['DayNum'] > 135]\n", | |
"len(df_predictions)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 68, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"GridSearchCV(cv=None, error_score='raise',\n", | |
" estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", | |
" intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", | |
" penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", | |
" verbose=0, warm_start=False),\n", | |
" fit_params=None, iid=True, n_jobs=1,\n", | |
" param_grid={'C': array([ 1.00000e-05, 1.29155e-04, 1.66810e-03, 2.15443e-02,\n", | |
" 2.78256e-01, 3.59381e+00, 4.64159e+01, 5.99484e+02,\n", | |
" 7.74264e+03, 1.00000e+05])},\n", | |
" pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", | |
" scoring='neg_log_loss', verbose=0)" | |
] | |
}, | |
"execution_count": 68, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Testing and training sets\n", | |
"df_train = df_predictions.loc[df_predictions['Season'] < 2014]\n", | |
"df_test = df_predictions.loc[df_predictions['Season'] >= 2014]\n", | |
"\n", | |
"X_train = df_train['Composite_Diff'].values.reshape(-1,1)\n", | |
"Y_train = df_train['Result'].values\n", | |
"\n", | |
"X_test = df_test['Composite_Diff'].values.reshape(-1,1)\n", | |
"Y_test = df_test['Result'].values\n", | |
"\n", | |
"logreg = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf3 = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf3" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 69, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-0.5427760668455921" | |
] | |
}, | |
"execution_count": 69, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Training score\n", | |
"clf3.fit(X_train, Y_train)\n", | |
"clf3.score(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 70, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Save model\n", | |
"filename = 'ncaa_tourney1.pkl'\n", | |
"#filename = 'ncaa_tourney2.pkl'\n", | |
"pickle.dump(clf3, open(filename, 'wb'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 71, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Confusion Matrix: \n", | |
"[[185 67]\n", | |
" [ 67 185]] \n", | |
"\n", | |
" precision recall f1-score support\n", | |
"\n", | |
" 0 0.73 0.73 0.73 252\n", | |
" 1 0.73 0.73 0.73 252\n", | |
"\n", | |
"avg / total 0.73 0.73 0.73 504\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# More results\n", | |
"print('Confusion Matrix: ')\n", | |
"print(confusion_matrix(Y_test, Y_pred), '\\n')\n", | |
"print(classification_report(Y_test, Y_pred))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Model Performance" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 72, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-0.51075848153406123" | |
] | |
}, | |
"execution_count": 72, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# 2014-2017 log loss\n", | |
"Y_pred = clf3.predict(X_test)\n", | |
"df_test['Pred'] = Y_pred\n", | |
"clf3.score(X_test, Y_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 73, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" <th>Result</th>\n", | |
" <th>Pred</th>\n", | |
" <th>Prob</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>717</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>1163</td>\n", | |
" <td>1386</td>\n", | |
" <td>0.091565</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0.683972</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>718</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>1173</td>\n", | |
" <td>1326</td>\n", | |
" <td>-0.137862</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0.761780</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>719</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>1196</td>\n", | |
" <td>1107</td>\n", | |
" <td>0.483541</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0.983329</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>720</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>1217</td>\n", | |
" <td>1153</td>\n", | |
" <td>-0.080948</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0.664310</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>721</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>1257</td>\n", | |
" <td>1264</td>\n", | |
" <td>0.271877</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0.908253</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID Composite_Diff Result Pred Prob\n", | |
"717 2014 136 1163 1386 0.091565 1 1 0.683972\n", | |
"718 2014 136 1173 1326 -0.137862 1 0 0.761780\n", | |
"719 2014 136 1196 1107 0.483541 1 1 0.983329\n", | |
"720 2014 136 1217 1153 -0.080948 1 0 0.664310\n", | |
"721 2014 136 1257 1264 0.271877 1 1 0.908253" | |
] | |
}, | |
"execution_count": 73, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Probabilities\n", | |
"probs = clf3.predict_proba(X_test)\n", | |
"Y_prob = [max(item[0],item[1]) for item in probs]\n", | |
"df_test['Prob'] = Y_prob\n", | |
"\n", | |
"df_test.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 74, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamName</th>\n", | |
" <th>LTeamName</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" <th>Prob</th>\n", | |
" <th>Pred</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>252</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>Connecticut</td>\n", | |
" <td>St Joseph's PA</td>\n", | |
" <td>-0.091565</td>\n", | |
" <td>0.683972</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>253</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>Dayton</td>\n", | |
" <td>Ohio St</td>\n", | |
" <td>0.137862</td>\n", | |
" <td>0.761780</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>254</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>Florida</td>\n", | |
" <td>Albany NY</td>\n", | |
" <td>-0.483541</td>\n", | |
" <td>0.983329</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>255</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>Harvard</td>\n", | |
" <td>Cincinnati</td>\n", | |
" <td>0.080948</td>\n", | |
" <td>0.664310</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>256</th>\n", | |
" <td>2014</td>\n", | |
" <td>136</td>\n", | |
" <td>Louisville</td>\n", | |
" <td>Manhattan</td>\n", | |
" <td>-0.271877</td>\n", | |
" <td>0.908253</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamName LTeamName Composite_Diff Prob \\\n", | |
"252 2014 136 Connecticut St Joseph's PA -0.091565 0.683972 \n", | |
"253 2014 136 Dayton Ohio St 0.137862 0.761780 \n", | |
"254 2014 136 Florida Albany NY -0.483541 0.983329 \n", | |
"255 2014 136 Harvard Cincinnati 0.080948 0.664310 \n", | |
"256 2014 136 Louisville Manhattan -0.271877 0.908253 \n", | |
"\n", | |
" Pred Result \n", | |
"252 0 0 \n", | |
"253 1 0 \n", | |
"254 0 0 \n", | |
"255 1 0 \n", | |
"256 0 0 " | |
] | |
}, | |
"execution_count": 74, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Teams dataframe\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"teams = pd.read_csv(data_dir + 'teams.csv')\n", | |
"teams.head()\n", | |
"\n", | |
"df_dummy = teams.rename(columns={'TeamID':'WTeamID'})\n", | |
"df_results = pd.merge(left=df_test, right=df_dummy, how='left', on=['WTeamID'])\n", | |
"\n", | |
"df_dummy = teams.rename(columns={'TeamID':'LTeamID'})\n", | |
"df_results = pd.merge(left=df_results, right=df_dummy, how='left', on=['LTeamID'])\n", | |
"\n", | |
"df_results = df_results.rename(columns={'TeamName_x':'WTeamName', 'TeamName_y':'LTeamName'})\n", | |
"df_results = df_results[['Season', 'DayNum', 'WTeamName', 'LTeamName', 'Composite_Diff', 'Prob', 'Pred', 'Result']]\n", | |
"df_results.drop_duplicates(subset=['Season','DayNum','WTeamName'], keep='last', inplace=True)\n", | |
"df_results.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 75, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamName</th>\n", | |
" <th>LTeamName</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" <th>Prob</th>\n", | |
" <th>Pred</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>499</th>\n", | |
" <td>2017</td>\n", | |
" <td>146</td>\n", | |
" <td>North Carolina</td>\n", | |
" <td>Kentucky</td>\n", | |
" <td>0.007129</td>\n", | |
" <td>0.515024</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>500</th>\n", | |
" <td>2017</td>\n", | |
" <td>146</td>\n", | |
" <td>South Carolina</td>\n", | |
" <td>Florida</td>\n", | |
" <td>0.091816</td>\n", | |
" <td>0.684429</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>498</th>\n", | |
" <td>2017</td>\n", | |
" <td>145</td>\n", | |
" <td>Oregon</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.041307</td>\n", | |
" <td>0.586205</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>496</th>\n", | |
" <td>2017</td>\n", | |
" <td>144</td>\n", | |
" <td>South Carolina</td>\n", | |
" <td>Baylor</td>\n", | |
" <td>0.089609</td>\n", | |
" <td>0.680396</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>492</th>\n", | |
" <td>2017</td>\n", | |
" <td>143</td>\n", | |
" <td>Xavier</td>\n", | |
" <td>Arizona</td>\n", | |
" <td>0.098212</td>\n", | |
" <td>0.695959</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>484</th>\n", | |
" <td>2017</td>\n", | |
" <td>139</td>\n", | |
" <td>Michigan</td>\n", | |
" <td>Louisville</td>\n", | |
" <td>0.057800</td>\n", | |
" <td>0.619488</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>487</th>\n", | |
" <td>2017</td>\n", | |
" <td>139</td>\n", | |
" <td>South Carolina</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.121985</td>\n", | |
" <td>0.736642</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>479</th>\n", | |
" <td>2017</td>\n", | |
" <td>138</td>\n", | |
" <td>Wisconsin</td>\n", | |
" <td>Villanova</td>\n", | |
" <td>0.091102</td>\n", | |
" <td>0.683126</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>475</th>\n", | |
" <td>2017</td>\n", | |
" <td>138</td>\n", | |
" <td>Florida</td>\n", | |
" <td>Virginia</td>\n", | |
" <td>0.024714</td>\n", | |
" <td>0.551910</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>480</th>\n", | |
" <td>2017</td>\n", | |
" <td>138</td>\n", | |
" <td>Xavier</td>\n", | |
" <td>Florida St</td>\n", | |
" <td>0.031929</td>\n", | |
" <td>0.566904</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>457</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Arkansas</td>\n", | |
" <td>Seton Hall</td>\n", | |
" <td>0.001082</td>\n", | |
" <td>0.502280</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>465</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Michigan St</td>\n", | |
" <td>Miami FL</td>\n", | |
" <td>0.008130</td>\n", | |
" <td>0.517131</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>468</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Rhode Island</td>\n", | |
" <td>Creighton</td>\n", | |
" <td>0.044419</td>\n", | |
" <td>0.592556</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>471</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>USC</td>\n", | |
" <td>SMU</td>\n", | |
" <td>0.135217</td>\n", | |
" <td>0.757709</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>469</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>South Carolina</td>\n", | |
" <td>Marquette</td>\n", | |
" <td>0.008009</td>\n", | |
" <td>0.516877</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>447</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>MTSU</td>\n", | |
" <td>Minnesota</td>\n", | |
" <td>0.028861</td>\n", | |
" <td>0.560540</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>448</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Northwestern</td>\n", | |
" <td>Vanderbilt</td>\n", | |
" <td>0.019119</td>\n", | |
" <td>0.540216</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamName LTeamName Composite_Diff Prob \\\n", | |
"499 2017 146 North Carolina Kentucky 0.007129 0.515024 \n", | |
"500 2017 146 South Carolina Florida 0.091816 0.684429 \n", | |
"498 2017 145 Oregon Kansas 0.041307 0.586205 \n", | |
"496 2017 144 South Carolina Baylor 0.089609 0.680396 \n", | |
"492 2017 143 Xavier Arizona 0.098212 0.695959 \n", | |
"484 2017 139 Michigan Louisville 0.057800 0.619488 \n", | |
"487 2017 139 South Carolina Duke 0.121985 0.736642 \n", | |
"479 2017 138 Wisconsin Villanova 0.091102 0.683126 \n", | |
"475 2017 138 Florida Virginia 0.024714 0.551910 \n", | |
"480 2017 138 Xavier Florida St 0.031929 0.566904 \n", | |
"457 2017 137 Arkansas Seton Hall 0.001082 0.502280 \n", | |
"465 2017 137 Michigan St Miami FL 0.008130 0.517131 \n", | |
"468 2017 137 Rhode Island Creighton 0.044419 0.592556 \n", | |
"471 2017 137 USC SMU 0.135217 0.757709 \n", | |
"469 2017 137 South Carolina Marquette 0.008009 0.516877 \n", | |
"447 2017 136 MTSU Minnesota 0.028861 0.560540 \n", | |
"448 2017 136 Northwestern Vanderbilt 0.019119 0.540216 \n", | |
"\n", | |
" Pred Result \n", | |
"499 1 0 \n", | |
"500 1 0 \n", | |
"498 1 0 \n", | |
"496 1 0 \n", | |
"492 1 0 \n", | |
"484 1 0 \n", | |
"487 1 0 \n", | |
"479 1 0 \n", | |
"475 1 0 \n", | |
"480 1 0 \n", | |
"457 1 0 \n", | |
"465 1 0 \n", | |
"468 1 0 \n", | |
"471 1 0 \n", | |
"469 1 0 \n", | |
"447 1 0 \n", | |
"448 1 0 " | |
] | |
}, | |
"execution_count": 75, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Wrong answers\n", | |
"incorrect = df_results.loc[df_results['Pred'] != df_results['Result']]\n", | |
"incorrect.sort_values(by='DayNum', ascending=False, inplace=True)\n", | |
"def get_incorrect_year(year):\n", | |
" incorrect_year = incorrect.loc[incorrect['Season'] == year]\n", | |
" return(incorrect_year)\n", | |
"get_incorrect_year(2017)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 76, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamName</th>\n", | |
" <th>LTeamName</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" <th>Prob</th>\n", | |
" <th>Pred</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>452</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Villanova</td>\n", | |
" <td>Mt St Mary's</td>\n", | |
" <td>-0.553039</td>\n", | |
" <td>0.990653</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>466</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>North Carolina</td>\n", | |
" <td>TX Southern</td>\n", | |
" <td>-0.515507</td>\n", | |
" <td>0.987217</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>461</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>UC Davis</td>\n", | |
" <td>-0.513945</td>\n", | |
" <td>0.987050</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>463</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Louisville</td>\n", | |
" <td>Jacksonville St</td>\n", | |
" <td>-0.484377</td>\n", | |
" <td>0.983444</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>445</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Gonzaga</td>\n", | |
" <td>S Dakota St</td>\n", | |
" <td>-0.424736</td>\n", | |
" <td>0.972917</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>462</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Kentucky</td>\n", | |
" <td>N Kentucky</td>\n", | |
" <td>-0.419841</td>\n", | |
" <td>0.971808</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>441</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Arizona</td>\n", | |
" <td>North Dakota</td>\n", | |
" <td>-0.411020</td>\n", | |
" <td>0.969697</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>460</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Duke</td>\n", | |
" <td>Troy</td>\n", | |
" <td>-0.409618</td>\n", | |
" <td>0.969348</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>470</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>UCLA</td>\n", | |
" <td>Kent</td>\n", | |
" <td>-0.302583</td>\n", | |
" <td>0.927667</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>467</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Oregon</td>\n", | |
" <td>Iona</td>\n", | |
" <td>-0.295587</td>\n", | |
" <td>0.923608</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>442</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Butler</td>\n", | |
" <td>Winthrop</td>\n", | |
" <td>-0.262592</td>\n", | |
" <td>0.901517</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>454</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>West Virginia</td>\n", | |
" <td>Bucknell</td>\n", | |
" <td>-0.251144</td>\n", | |
" <td>0.892609</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>444</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Florida St</td>\n", | |
" <td>FL Gulf Coast</td>\n", | |
" <td>-0.247032</td>\n", | |
" <td>0.889239</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>458</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Baylor</td>\n", | |
" <td>New Mexico St</td>\n", | |
" <td>-0.246920</td>\n", | |
" <td>0.889146</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>443</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Florida</td>\n", | |
" <td>ETSU</td>\n", | |
" <td>-0.222845</td>\n", | |
" <td>0.867502</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>453</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Virginia</td>\n", | |
" <td>UNC Wilmington</td>\n", | |
" <td>-0.197225</td>\n", | |
" <td>0.840642</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>450</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Purdue</td>\n", | |
" <td>Vermont</td>\n", | |
" <td>-0.188957</td>\n", | |
" <td>0.831080</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>449</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Notre Dame</td>\n", | |
" <td>Princeton</td>\n", | |
" <td>-0.139029</td>\n", | |
" <td>0.763562</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>471</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>USC</td>\n", | |
" <td>SMU</td>\n", | |
" <td>0.135217</td>\n", | |
" <td>0.757709</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>446</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Iowa St</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>-0.122716</td>\n", | |
" <td>0.737836</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>455</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Wisconsin</td>\n", | |
" <td>Virginia Tech</td>\n", | |
" <td>-0.099396</td>\n", | |
" <td>0.698068</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>472</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Wichita St</td>\n", | |
" <td>Dayton</td>\n", | |
" <td>-0.091455</td>\n", | |
" <td>0.683771</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>459</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Cincinnati</td>\n", | |
" <td>Kansas St</td>\n", | |
" <td>-0.064799</td>\n", | |
" <td>0.633297</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>451</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>St Mary's CA</td>\n", | |
" <td>VA Commonwealth</td>\n", | |
" <td>-0.054220</td>\n", | |
" <td>0.612347</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>468</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Rhode Island</td>\n", | |
" <td>Creighton</td>\n", | |
" <td>0.044419</td>\n", | |
" <td>0.592556</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>464</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Michigan</td>\n", | |
" <td>Oklahoma St</td>\n", | |
" <td>-0.035108</td>\n", | |
" <td>0.573473</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>447</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>MTSU</td>\n", | |
" <td>Minnesota</td>\n", | |
" <td>0.028861</td>\n", | |
" <td>0.560540</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>448</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Northwestern</td>\n", | |
" <td>Vanderbilt</td>\n", | |
" <td>0.019119</td>\n", | |
" <td>0.540216</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>465</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Michigan St</td>\n", | |
" <td>Miami FL</td>\n", | |
" <td>0.008130</td>\n", | |
" <td>0.517131</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>469</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>South Carolina</td>\n", | |
" <td>Marquette</td>\n", | |
" <td>0.008009</td>\n", | |
" <td>0.516877</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>456</th>\n", | |
" <td>2017</td>\n", | |
" <td>136</td>\n", | |
" <td>Xavier</td>\n", | |
" <td>Maryland</td>\n", | |
" <td>-0.006985</td>\n", | |
" <td>0.514721</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>457</th>\n", | |
" <td>2017</td>\n", | |
" <td>137</td>\n", | |
" <td>Arkansas</td>\n", | |
" <td>Seton Hall</td>\n", | |
" <td>0.001082</td>\n", | |
" <td>0.502280</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamName LTeamName Composite_Diff \\\n", | |
"452 2017 136 Villanova Mt St Mary's -0.553039 \n", | |
"466 2017 137 North Carolina TX Southern -0.515507 \n", | |
"461 2017 137 Kansas UC Davis -0.513945 \n", | |
"463 2017 137 Louisville Jacksonville St -0.484377 \n", | |
"445 2017 136 Gonzaga S Dakota St -0.424736 \n", | |
"462 2017 137 Kentucky N Kentucky -0.419841 \n", | |
"441 2017 136 Arizona North Dakota -0.411020 \n", | |
"460 2017 137 Duke Troy -0.409618 \n", | |
"470 2017 137 UCLA Kent -0.302583 \n", | |
"467 2017 137 Oregon Iona -0.295587 \n", | |
"442 2017 136 Butler Winthrop -0.262592 \n", | |
"454 2017 136 West Virginia Bucknell -0.251144 \n", | |
"444 2017 136 Florida St FL Gulf Coast -0.247032 \n", | |
"458 2017 137 Baylor New Mexico St -0.246920 \n", | |
"443 2017 136 Florida ETSU -0.222845 \n", | |
"453 2017 136 Virginia UNC Wilmington -0.197225 \n", | |
"450 2017 136 Purdue Vermont -0.188957 \n", | |
"449 2017 136 Notre Dame Princeton -0.139029 \n", | |
"471 2017 137 USC SMU 0.135217 \n", | |
"446 2017 136 Iowa St Nevada -0.122716 \n", | |
"455 2017 136 Wisconsin Virginia Tech -0.099396 \n", | |
"472 2017 137 Wichita St Dayton -0.091455 \n", | |
"459 2017 137 Cincinnati Kansas St -0.064799 \n", | |
"451 2017 136 St Mary's CA VA Commonwealth -0.054220 \n", | |
"468 2017 137 Rhode Island Creighton 0.044419 \n", | |
"464 2017 137 Michigan Oklahoma St -0.035108 \n", | |
"447 2017 136 MTSU Minnesota 0.028861 \n", | |
"448 2017 136 Northwestern Vanderbilt 0.019119 \n", | |
"465 2017 137 Michigan St Miami FL 0.008130 \n", | |
"469 2017 137 South Carolina Marquette 0.008009 \n", | |
"456 2017 136 Xavier Maryland -0.006985 \n", | |
"457 2017 137 Arkansas Seton Hall 0.001082 \n", | |
"\n", | |
" Prob Pred Result \n", | |
"452 0.990653 0 0 \n", | |
"466 0.987217 0 0 \n", | |
"461 0.987050 0 0 \n", | |
"463 0.983444 0 0 \n", | |
"445 0.972917 0 0 \n", | |
"462 0.971808 0 0 \n", | |
"441 0.969697 0 0 \n", | |
"460 0.969348 0 0 \n", | |
"470 0.927667 0 0 \n", | |
"467 0.923608 0 0 \n", | |
"442 0.901517 0 0 \n", | |
"454 0.892609 0 0 \n", | |
"444 0.889239 0 0 \n", | |
"458 0.889146 0 0 \n", | |
"443 0.867502 0 0 \n", | |
"453 0.840642 0 0 \n", | |
"450 0.831080 0 0 \n", | |
"449 0.763562 0 0 \n", | |
"471 0.757709 1 0 \n", | |
"446 0.737836 0 0 \n", | |
"455 0.698068 0 0 \n", | |
"472 0.683771 0 0 \n", | |
"459 0.633297 0 0 \n", | |
"451 0.612347 0 0 \n", | |
"468 0.592556 1 0 \n", | |
"464 0.573473 0 0 \n", | |
"447 0.560540 1 0 \n", | |
"448 0.540216 1 0 \n", | |
"465 0.517131 1 0 \n", | |
"469 0.516877 1 0 \n", | |
"456 0.514721 0 0 \n", | |
"457 0.502280 1 0 " | |
] | |
}, | |
"execution_count": 76, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# First round 2017\n", | |
"def get_firstround_year(year):\n", | |
" first_round = df_results.loc[(df_results['DayNum'] <= 137) & (df_results['Season'] == year)] \n", | |
" first_round.sort_values(by='Prob', ascending=False, inplace=True)\n", | |
" return(first_round)\n", | |
"get_firstround_year(2017)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 77, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>TeamName</th>\n", | |
" <th>Composite Score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>2964</th>\n", | |
" <td>2009</td>\n", | |
" <td>North Carolina</td>\n", | |
" <td>0.999286</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2963</th>\n", | |
" <td>2008</td>\n", | |
" <td>North Carolina</td>\n", | |
" <td>0.995707</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1100</th>\n", | |
" <td>2006</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.993241</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1966</th>\n", | |
" <td>2011</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.990613</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4746</th>\n", | |
" <td>2017</td>\n", | |
" <td>Villanova</td>\n", | |
" <td>0.987834</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1098</th>\n", | |
" <td>2004</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.987728</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1097</th>\n", | |
" <td>2003</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.986729</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2027</th>\n", | |
" <td>2015</td>\n", | |
" <td>Kentucky</td>\n", | |
" <td>0.985544</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1965</th>\n", | |
" <td>2010</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.983324</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1105</th>\n", | |
" <td>2011</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.982254</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1099</th>\n", | |
" <td>2005</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.981524</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1963</th>\n", | |
" <td>2008</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.978791</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1761</th>\n", | |
" <td>2005</td>\n", | |
" <td>Illinois</td>\n", | |
" <td>0.977183</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1967</th>\n", | |
" <td>2012</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.976465</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1104</th>\n", | |
" <td>2010</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.975412</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2962</th>\n", | |
" <td>2007</td>\n", | |
" <td>North Carolina</td>\n", | |
" <td>0.975156</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2024</th>\n", | |
" <td>2012</td>\n", | |
" <td>Kentucky</td>\n", | |
" <td>0.975007</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1322</th>\n", | |
" <td>2014</td>\n", | |
" <td>Florida</td>\n", | |
" <td>0.974509</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1971</th>\n", | |
" <td>2016</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.972754</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2189</th>\n", | |
" <td>2014</td>\n", | |
" <td>Louisville</td>\n", | |
" <td>0.972702</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season TeamName Composite Score\n", | |
"2964 2009 North Carolina 0.999286\n", | |
"2963 2008 North Carolina 0.995707\n", | |
"1100 2006 Duke 0.993241\n", | |
"1966 2011 Kansas 0.990613\n", | |
"4746 2017 Villanova 0.987834\n", | |
"1098 2004 Duke 0.987728\n", | |
"1097 2003 Duke 0.986729\n", | |
"2027 2015 Kentucky 0.985544\n", | |
"1965 2010 Kansas 0.983324\n", | |
"1105 2011 Duke 0.982254\n", | |
"1099 2005 Duke 0.981524\n", | |
"1963 2008 Kansas 0.978791\n", | |
"1761 2005 Illinois 0.977183\n", | |
"1967 2012 Kansas 0.976465\n", | |
"1104 2010 Duke 0.975412\n", | |
"2962 2007 North Carolina 0.975156\n", | |
"2024 2012 Kentucky 0.975007\n", | |
"1322 2014 Florida 0.974509\n", | |
"1971 2016 Kansas 0.972754\n", | |
"2189 2014 Louisville 0.972702" | |
] | |
}, | |
"execution_count": 77, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Best composite scoring teams ever\n", | |
"df_dummy = teams.rename(columns={'TeamID':'Team_ID'})\n", | |
"df_scores = pd.merge(left=df, right=df_dummy, how='left', on=['Team_ID'])\n", | |
"df_scores = df_scores[['Season', 'TeamName', 'Composite Score']]\n", | |
"df_scores.sort_values(by='Composite Score', ascending=False, inplace=True)\n", | |
"df_scores.head(20)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Stage 1 Submission" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 78, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014_1107_1110</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2014_1107_1112</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2014_1107_1113</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2014_1107_1124</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2014_1107_1140</td>\n", | |
" <td>0.5</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Pred\n", | |
"0 2014_1107_1110 0.5\n", | |
"1 2014_1107_1112 0.5\n", | |
"2 2014_1107_1113 0.5\n", | |
"3 2014_1107_1124 0.5\n", | |
"4 2014_1107_1140 0.5" | |
] | |
}, | |
"execution_count": 78, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Sample submission dataframe\n", | |
"data_dir = './March Madness 2018/'\n", | |
"sample = pd.read_csv(data_dir + 'SampleSubmissionStage1.csv')\n", | |
"sample.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 79, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Pred</th>\n", | |
" <th>Season</th>\n", | |
" <th>Team_ID_Low</th>\n", | |
" <th>Team_ID_High</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014_1107_1110</td>\n", | |
" <td>0.5</td>\n", | |
" <td>2014</td>\n", | |
" <td>1107</td>\n", | |
" <td>1110</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2014_1107_1112</td>\n", | |
" <td>0.5</td>\n", | |
" <td>2014</td>\n", | |
" <td>1107</td>\n", | |
" <td>1112</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2014_1107_1113</td>\n", | |
" <td>0.5</td>\n", | |
" <td>2014</td>\n", | |
" <td>1107</td>\n", | |
" <td>1113</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2014_1107_1124</td>\n", | |
" <td>0.5</td>\n", | |
" <td>2014</td>\n", | |
" <td>1107</td>\n", | |
" <td>1124</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2014_1107_1140</td>\n", | |
" <td>0.5</td>\n", | |
" <td>2014</td>\n", | |
" <td>1107</td>\n", | |
" <td>1140</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Pred Season Team_ID_Low Team_ID_High\n", | |
"0 2014_1107_1110 0.5 2014 1107 1110\n", | |
"1 2014_1107_1112 0.5 2014 1107 1112\n", | |
"2 2014_1107_1113 0.5 2014 1107 1113\n", | |
"3 2014_1107_1124 0.5 2014 1107 1124\n", | |
"4 2014_1107_1140 0.5 2014 1107 1140" | |
] | |
}, | |
"execution_count": 79, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Pull relevant information from ID\n", | |
"sample['Season'] = sample.apply(lambda row: row['ID'][0:4], axis=1)\n", | |
"sample['Team_ID_Low'] = sample.apply(lambda row: row['ID'][5:9], axis=1)\n", | |
"sample['Team_ID_High'] = sample.apply(lambda row: row['ID'][10:14], axis=1)\n", | |
"sample.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 80, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Merge composite scores\n", | |
"df['Season'] = df['Season'].astype(str)\n", | |
"df['Team_ID'] = df['Team_ID'].astype(str)\n", | |
"\n", | |
"df_lows = df.rename(columns={'Composite Score':'Score', 'Team_ID':'Team_ID_Low'})\n", | |
"df_highs = df.rename(columns={'Composite Score':'Score', 'Team_ID':'Team_ID_High'})\n", | |
"\n", | |
"df_dummy = pd.merge(left=sample, right=df_lows, how='left', on=['Season', 'Team_ID_Low'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_highs, on=['Season', 'Team_ID_High'])\n", | |
"df_sample = df_concat.rename(columns={'Score_x':'Score_Low', 'Score_y':'Score_High'})\n", | |
"df_sample['Score_Diff'] = df_sample['Score_Low'] - df_sample['Score_High']\n", | |
"df_full = df_sample\n", | |
"df_sample = df_sample[['ID', 'Score_Low', 'Score_High', 'Score_Diff','Pred']]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 81, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Score_Low</th>\n", | |
" <th>Score_High</th>\n", | |
" <th>Score_Diff</th>\n", | |
" <th>Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014_1107_1110</td>\n", | |
" <td>0.490968</td>\n", | |
" <td>0.589957</td>\n", | |
" <td>-0.098989</td>\n", | |
" <td>0.302656</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2014_1107_1112</td>\n", | |
" <td>0.490968</td>\n", | |
" <td>0.936298</td>\n", | |
" <td>-0.445330</td>\n", | |
" <td>0.022864</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2014_1110_1112</td>\n", | |
" <td>0.589957</td>\n", | |
" <td>0.936298</td>\n", | |
" <td>-0.346341</td>\n", | |
" <td>0.051156</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2014_1107_1113</td>\n", | |
" <td>0.490968</td>\n", | |
" <td>0.781062</td>\n", | |
" <td>-0.290094</td>\n", | |
" <td>0.079725</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2014_1110_1113</td>\n", | |
" <td>0.589957</td>\n", | |
" <td>0.781062</td>\n", | |
" <td>-0.191105</td>\n", | |
" <td>0.166392</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Score_Low Score_High Score_Diff Pred\n", | |
"0 2014_1107_1110 0.490968 0.589957 -0.098989 0.302656\n", | |
"1 2014_1107_1112 0.490968 0.936298 -0.445330 0.022864\n", | |
"2 2014_1110_1112 0.589957 0.936298 -0.346341 0.051156\n", | |
"3 2014_1107_1113 0.490968 0.781062 -0.290094 0.079725\n", | |
"4 2014_1110_1113 0.589957 0.781062 -0.191105 0.166392" | |
] | |
}, | |
"execution_count": 81, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Probabilities\n", | |
"diffs = df_sample['Score_Diff'].values.reshape(-1,1)\n", | |
"probs = clf3.predict_proba(diffs)\n", | |
"Y_prob = [item[1] for item in probs]\n", | |
"df_sample['Pred'] = Y_prob\n", | |
"df_sample.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 82, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2014_1107_1110</td>\n", | |
" <td>0.302656</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2014_1107_1112</td>\n", | |
" <td>0.022864</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2014_1110_1112</td>\n", | |
" <td>0.051156</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2014_1107_1113</td>\n", | |
" <td>0.079725</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2014_1110_1113</td>\n", | |
" <td>0.166392</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Pred\n", | |
"0 2014_1107_1110 0.302656\n", | |
"1 2014_1107_1112 0.022864\n", | |
"2 2014_1110_1112 0.051156\n", | |
"3 2014_1107_1113 0.079725\n", | |
"4 2014_1110_1113 0.166392" | |
] | |
}, | |
"execution_count": 82, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Submission\n", | |
"df_submission = df_sample[['ID', 'Pred']]\n", | |
"df_submission.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 83, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Write to csv\n", | |
"df_submission.to_csv('stage1_submission.csv', index=None)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Reformat Data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 84, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>W_Composite</th>\n", | |
" <th>L_Composite</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>0.319824</td>\n", | |
" <td>0.301622</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>0.963552</td>\n", | |
" <td>0.508381</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>0.825212</td>\n", | |
" <td>0.853798</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>0.756841</td>\n", | |
" <td>0.851419</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>0.840379</td>\n", | |
" <td>0.819419</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID W_Composite L_Composite\n", | |
"0 2003 134 1421 1411 0.319824 0.301622\n", | |
"1 2003 136 1112 1436 0.963552 0.508381\n", | |
"2 2003 136 1113 1272 0.825212 0.853798\n", | |
"3 2003 136 1141 1166 0.756841 0.851419\n", | |
"4 2003 136 1143 1301 0.840379 0.819419" | |
] | |
}, | |
"execution_count": 84, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"season_elos = season_elos.rename(columns={'team_id':'Team_ID', 'season':'Season', 'season_elo':'Elo'}) \n", | |
"df = pd.merge(left=season_elos, right=df_topranks, how='left', on=['Season', 'Team_ID'])\n", | |
"df.dropna(inplace=True)\n", | |
"scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n", | |
"df['Elo_Scaled'] = scaler.fit_transform(df['Elo'].values.reshape(-1,1))\n", | |
"df['MeanRank_Scaled'] = 1 - scaler.fit_transform(df['MeanRank'].values.reshape(-1,1))\n", | |
"df['Composite Score'] = (df['Elo_Scaled'] + df['MeanRank_Scaled']) / 2\n", | |
"\n", | |
"data_dir = './March Madness 2018/DataFiles/'\n", | |
"df_tour = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')\n", | |
"df_tour.drop(labels=['WLoc', 'NumOT', 'WScore', 'LScore'], inplace=True, axis=1)\n", | |
"df.drop(labels=['Elo', 'season_elo', 'MeanRank'], inplace=True, axis=1)\n", | |
"\n", | |
"df_win_elos = df.rename(columns={'Team_ID':'WTeamID', 'Composite Score':'W_Composite'})\n", | |
"df_loss_elos = df.rename(columns={'Team_ID':'LTeamID', 'Composite Score':'L_Composite'}) \n", | |
"df_dummy = pd.merge(left=df_tour, right=df_win_elos, how='left', on=['Season', 'WTeamID'])\n", | |
"df_concat = pd.merge(left=df_dummy, right=df_loss_elos, on=['Season', 'LTeamID'])\n", | |
"\n", | |
"df_total = df_concat[['Season', 'DayNum', 'WTeamID', 'LTeamID', 'W_Composite', 'L_Composite']]\n", | |
"df_total.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 85, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>TeamID_Upper</th>\n", | |
" <th>TeamID_Lower</th>\n", | |
" <th>Composite_Upper</th>\n", | |
" <th>Composite_Lower</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>0.319824</td>\n", | |
" <td>0.301622</td>\n", | |
" <td>-0.018201</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1112</td>\n", | |
" <td>1436</td>\n", | |
" <td>1436</td>\n", | |
" <td>1112</td>\n", | |
" <td>0.963552</td>\n", | |
" <td>0.508381</td>\n", | |
" <td>-0.455171</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1113</td>\n", | |
" <td>1272</td>\n", | |
" <td>1272</td>\n", | |
" <td>1113</td>\n", | |
" <td>0.853798</td>\n", | |
" <td>0.825212</td>\n", | |
" <td>-0.028586</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1141</td>\n", | |
" <td>1166</td>\n", | |
" <td>1166</td>\n", | |
" <td>1141</td>\n", | |
" <td>0.851419</td>\n", | |
" <td>0.756841</td>\n", | |
" <td>-0.094578</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1143</td>\n", | |
" <td>1301</td>\n", | |
" <td>1301</td>\n", | |
" <td>1143</td>\n", | |
" <td>0.840379</td>\n", | |
" <td>0.819419</td>\n", | |
" <td>-0.020960</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID LTeamID TeamID_Upper TeamID_Lower \\\n", | |
"0 2003 134 1421 1411 1421 1411 \n", | |
"1 2003 136 1112 1436 1436 1112 \n", | |
"2 2003 136 1113 1272 1272 1113 \n", | |
"3 2003 136 1141 1166 1166 1141 \n", | |
"4 2003 136 1143 1301 1301 1143 \n", | |
"\n", | |
" Composite_Upper Composite_Lower Composite_Diff \n", | |
"0 0.319824 0.301622 -0.018201 \n", | |
"1 0.963552 0.508381 -0.455171 \n", | |
"2 0.853798 0.825212 -0.028586 \n", | |
"3 0.851419 0.756841 -0.094578 \n", | |
"4 0.840379 0.819419 -0.020960 " | |
] | |
}, | |
"execution_count": 85, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_total['TeamID_Upper'] = np.where(df_total['WTeamID'] >= df_total['LTeamID'], df_total['WTeamID'], df_total['LTeamID'])\n", | |
"df_total['TeamID_Lower'] = np.where(df_total['LTeamID'] >= df_total['WTeamID'], df_total['WTeamID'], df_total['LTeamID'])\n", | |
"\n", | |
"df_total['Composite_Upper'] = np.where(df_total['W_Composite'] >= df_total['L_Composite'], df_total['W_Composite'], df_total['L_Composite'])\n", | |
"df_total['Composite_Lower'] = np.where(df_total['L_Composite'] >= df_total['W_Composite'], df_total['W_Composite'], df_total['L_Composite'])\n", | |
"\n", | |
"df_total['Composite_Diff'] = df_total['Composite_Lower'] - df_total['Composite_Upper']\n", | |
"df_total = df_total[['Season', 'DayNum', 'WTeamID', 'LTeamID', 'TeamID_Upper', 'TeamID_Lower', 'Composite_Upper', 'Composite_Lower', 'Composite_Diff']]\n", | |
"\n", | |
"df_total.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 86, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>TeamID_Upper</th>\n", | |
" <th>TeamID_Lower</th>\n", | |
" <th>Composite_Upper</th>\n", | |
" <th>Composite_Lower</th>\n", | |
" <th>Composite_Diff</th>\n", | |
" <th>Result</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2003</td>\n", | |
" <td>134</td>\n", | |
" <td>1421</td>\n", | |
" <td>1411</td>\n", | |
" <td>0.319824</td>\n", | |
" <td>0.301622</td>\n", | |
" <td>-0.018201</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1436</td>\n", | |
" <td>1112</td>\n", | |
" <td>0.963552</td>\n", | |
" <td>0.508381</td>\n", | |
" <td>-0.455171</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1272</td>\n", | |
" <td>1113</td>\n", | |
" <td>0.853798</td>\n", | |
" <td>0.825212</td>\n", | |
" <td>-0.028586</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1166</td>\n", | |
" <td>1141</td>\n", | |
" <td>0.851419</td>\n", | |
" <td>0.756841</td>\n", | |
" <td>-0.094578</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2003</td>\n", | |
" <td>136</td>\n", | |
" <td>1301</td>\n", | |
" <td>1143</td>\n", | |
" <td>0.840379</td>\n", | |
" <td>0.819419</td>\n", | |
" <td>-0.020960</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum TeamID_Upper TeamID_Lower Composite_Upper \\\n", | |
"0 2003 134 1421 1411 0.319824 \n", | |
"1 2003 136 1436 1112 0.963552 \n", | |
"2 2003 136 1272 1113 0.853798 \n", | |
"3 2003 136 1166 1141 0.851419 \n", | |
"4 2003 136 1301 1143 0.840379 \n", | |
"\n", | |
" Composite_Lower Composite_Diff Result \n", | |
"0 0.301622 -0.018201 0 \n", | |
"1 0.508381 -0.455171 1 \n", | |
"2 0.825212 -0.028586 1 \n", | |
"3 0.756841 -0.094578 1 \n", | |
"4 0.819419 -0.020960 1 " | |
] | |
}, | |
"execution_count": 86, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_total['Result'] = np.where(df_total['WTeamID'] == df_total['TeamID_Lower'], 1, 0)\n", | |
"df_predictions = df_total.drop(['WTeamID', 'LTeamID'], axis=1)\n", | |
"df_predictions.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 87, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-0.69278746859047591" | |
] | |
}, | |
"execution_count": 87, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_train = df_predictions.loc[df_predictions['Season'] < 2014]\n", | |
"df_test = df_predictions.loc[df_predictions['Season'] >= 2014]\n", | |
"\n", | |
"X_train = df_train['Composite_Diff'].values.reshape(-1,1)\n", | |
"Y_train = df_train['Result'].values\n", | |
"\n", | |
"X_test = df_test['Composite_Diff'].values.reshape(-1,1)\n", | |
"Y_test = df_test['Result'].values\n", | |
"\n", | |
"logreg = LogisticRegression()\n", | |
"params = {'C': np.logspace(start=-5, stop=5, num=10)}\n", | |
"clf3 = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)\n", | |
"clf3.fit(X_train, Y_train)\n", | |
"clf3.score(X_train, Y_train)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Simplified Elements From Above" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_teams_df(year):\n", | |
" \n", | |
" # Get all teams for all seasons\n", | |
" stage2_dir = './March Madness 2018/Stage2UpdatedDataFiles/'\n", | |
" df = pd.read_csv(stage2_dir + 'RegularSeasonCompactResults.csv')\n", | |
"\n", | |
" df = df.loc[df['Season'] == year]\n", | |
" team_ids = set(df.WTeamID).union(set(df.LTeamID))\n", | |
" team_list = list(team_ids)\n", | |
" teams = pd.DataFrame({'Team_ID':team_list})\n", | |
" teams['Season'] = year\n", | |
" teams = teams[['Season', 'Team_ID']]\n", | |
" return(teams)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_team_name(id):\n", | |
" \n", | |
" # Get school name for a given team id in 2018\n", | |
" stage2_dir = './March Madness 2018/Stage2UpdatedDataFiles/'\n", | |
" teams = pd.read_csv(stage2_dir + 'teams.csv')\n", | |
" name = teams.loc[teams['TeamID'] == id]['TeamName']\n", | |
" return(name.values[0])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_team_id(name):\n", | |
" \n", | |
" # Get school name for a given team id in 2018\n", | |
" stage2_dir = './March Madness 2018/Stage2UpdatedDataFiles/'\n", | |
" teams = pd.read_csv(stage2_dir + 'teams.csv')\n", | |
" id = teams.loc[teams['TeamName'] == name]['TeamID']\n", | |
" return(id.values[0])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def elo_pred(elo1, elo2):\n", | |
" return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))\n", | |
"\n", | |
"def expected_margin(elo_diff):\n", | |
" return((7.5 + 0.006 * elo_diff))\n", | |
"\n", | |
"def elo_update(w_elo, l_elo, margin, K):\n", | |
" elo_diff = w_elo - l_elo\n", | |
" pred = elo_pred(w_elo, l_elo)\n", | |
" mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)\n", | |
" update = K * mult * (1 - pred)\n", | |
" return(pred, update)\n", | |
"\n", | |
"def final_elo_per_season(df, team_id):\n", | |
" d = df.copy()\n", | |
" d = d.loc[(d.WTeamID == team_id) | (d.LTeamID == team_id), :]\n", | |
" d.sort_values(['Season', 'DayNum'], inplace=True)\n", | |
" d.drop_duplicates(['Season'], keep='last', inplace=True)\n", | |
" w_mask = d.WTeamID == team_id\n", | |
" l_mask = d.LTeamID == team_id\n", | |
" d['season_elo'] = None\n", | |
" d.loc[w_mask, 'season_elo'] = d.loc[w_mask, 'w_elo']\n", | |
" d.loc[l_mask, 'season_elo'] = d.loc[l_mask, 'l_elo']\n", | |
" out = pd.DataFrame({\n", | |
" 'team_id': team_id,\n", | |
" 'season': d.Season,\n", | |
" 'season_elo': d.season_elo\n", | |
" })\n", | |
" return(out)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_elos_df(year):\n", | |
" \n", | |
" # Data\n", | |
" stage2_dir = './March Madness 2018/Stage2UpdatedDataFiles/'\n", | |
" df = pd.read_csv(stage2_dir + 'RegularSeasonCompactResults.csv')\n", | |
" \n", | |
" # Constants\n", | |
" HOME_ADVANTAGE = 100 \n", | |
" K = 22\n", | |
" rs = df.loc[df['Season'] == year]\n", | |
" rs.reset_index(inplace = True)\n", | |
" \n", | |
" # Dictionary for lookups\n", | |
" team_ids = set(rs.WTeamID).union(set(rs.LTeamID))\n", | |
" elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))\n", | |
"\n", | |
" # Set up columns\n", | |
" rs['margin'] = rs.WScore - rs.LScore\n", | |
" rs['w_elo'] = None\n", | |
" rs['l_elo'] = None\n", | |
" \n", | |
" # Iterate through regular season\n", | |
" preds = []\n", | |
" for i in range(rs.shape[0]):\n", | |
"\n", | |
" # Get key data from current row\n", | |
" w = rs.at[i, 'WTeamID']\n", | |
" l = rs.at[i, 'LTeamID']\n", | |
" margin = rs.at[i, 'margin']\n", | |
" wloc = rs.at[i, 'WLoc']\n", | |
"\n", | |
" # Does either team get a home-court advantage?\n", | |
" w_ad, l_ad, = 0., 0.\n", | |
" if wloc == \"H\":\n", | |
" w_ad += HOME_ADVANTAGE\n", | |
" elif wloc == \"A\":\n", | |
" l_ad += HOME_ADVANTAGE\n", | |
"\n", | |
" # Get elo updates as a result of the game\n", | |
" pred, update = elo_update(elo_dict[w] + w_ad,\n", | |
" elo_dict[l] + l_ad, \n", | |
" margin, K)\n", | |
" elo_dict[w] += update\n", | |
" elo_dict[l] -= update\n", | |
" preds.append(pred)\n", | |
"\n", | |
" # Stores new elos in the games dataframe\n", | |
" rs.loc[i, 'w_elo'] = elo_dict[w]\n", | |
" rs.loc[i, 'l_elo'] = elo_dict[l]\n", | |
" \n", | |
" # Create and return final elo dataframe\n", | |
" df_list = [final_elo_per_season(rs, i) for i in team_ids]\n", | |
" season_elos = pd.concat(df_list)\n", | |
" season_elos.rename(columns={'season':'Season', 'team_id':'Team_ID', 'season_elo':'Elo'}, inplace = True)\n", | |
" return(season_elos)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_elo_score(elos_df, year, team_id):\n", | |
" \n", | |
" # Return final elo for a team in a given year\n", | |
" score = elos_df.loc[(elos_df['season'] == year) & (elos_df['team_id'] == team_id)]['season_elo']\n", | |
" return(score)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_select_ranks_df(year, day):\n", | |
"\n", | |
" # Get select ranking scores dataframe\n", | |
" data_dir = './March Madness 2018/'\n", | |
" df = pd.read_csv(data_dir + 'MasseyOrdinals_thruSeason2018_Day128.csv')\n", | |
"\n", | |
" # Get final day\n", | |
" data_dir = './March Madness 2018/'\n", | |
" df2 = pd.read_csv(data_dir + 'MasseyOrdinals_2018_133_only_53Systems.csv')\n", | |
" df = df.append(df2)\n", | |
" \n", | |
" # Set up\n", | |
" teams = get_teams_df(year)\n", | |
" df_massey = df.loc[df['Season'] == 2018]\n", | |
"\n", | |
" df_temp = df_massey.loc[(df_massey['RankingDayNum'] == day) & (df_massey['SystemName'] == 'SAG')]\n", | |
" df_temp = df_temp.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
" df_temp.rename(columns={'OrdinalRank':'SAG', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
" df_temp2 = df_massey.loc[(df_massey['RankingDayNum'] == day) & (df_massey['SystemName'] == 'WLK')]\n", | |
" df_temp2 = df_temp2.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
" df_temp2.rename(columns={'OrdinalRank':'WLK', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
" df_temp3 = df_massey.loc[(df_massey['RankingDayNum'] == day) & (df_massey['SystemName'] == 'POM')]\n", | |
" df_temp3 = df_temp3.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
" df_temp3.rename(columns={'OrdinalRank':'POM', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
" df_temp4 = df_massey.loc[(df_massey['RankingDayNum'] == day) & (df_massey['SystemName'] == 'MOR')]\n", | |
" df_temp4 = df_temp4.drop(labels=['RankingDayNum', 'SystemName'], axis=1)\n", | |
" df_temp4.rename(columns={'OrdinalRank':'MOR', 'TeamID':'Team_ID'}, inplace=True)\n", | |
"\n", | |
" teams = pd.merge(left=teams, right=df_temp, how='left', on=['Season', 'Team_ID'])\n", | |
" teams = pd.merge(left=teams, right=df_temp2, how='left', on=['Season', 'Team_ID'])\n", | |
" teams = pd.merge(left=teams, right=df_temp3, how='left', on=['Season', 'Team_ID'])\n", | |
" teams = pd.merge(left=teams, right=df_temp4, how='left', on=['Season', 'Team_ID'])\n", | |
" \n", | |
" # Calculate mean score\n", | |
" teams['MeanRank'] = (teams['SAG'] + teams['WLK'] + teams['POM'] + teams['MOR']) / 4\n", | |
" teams.dropna(inplace = True)\n", | |
" massey_df = teams\n", | |
" return(massey_df)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_select_rank(massey_df, year, day, team_id):\n", | |
" score = massey_df.loc[(massey_df['Season'] == year) & (massey_df['Team_ID'] == team_id)]['MeanRank']\n", | |
" return(score)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def get_composite_scores_df(year):\n", | |
" \n", | |
" # Get dataframe with composite scores for all teams\n", | |
" FINAL_DAY = 133\n", | |
" df = get_teams_df(year)\n", | |
" ranks = get_select_ranks_df(year, FINAL_DAY)\n", | |
" season_elos = get_elos_df(year)\n", | |
"\n", | |
" df = pd.merge(left=df, right=season_elos, how='left', on=['Season', 'Team_ID'])\n", | |
" df = pd.merge(left=df, right=ranks, how='left', on=['Season', 'Team_ID'])\n", | |
" df = df[['Season', 'Team_ID', 'MeanRank', 'Elo']]\n", | |
"\n", | |
" # Normalize features\n", | |
" scaler = preprocessing.MinMaxScaler(feature_range=(0,1))\n", | |
" df['Elo_Scaled'] = scaler.fit_transform(df['Elo'].values.reshape(-1,1))\n", | |
" df['MeanRank_Scaled'] = 1 - scaler.fit_transform(df['MeanRank'].values.reshape(-1,1))\n", | |
"\n", | |
" # Average rankings\n", | |
" df['Composite Score'] = (df['Elo_Scaled'] + (2 * df['MeanRank_Scaled'])) / 3\n", | |
" df = df[['Season', 'Team_ID', 'Composite Score']]\n", | |
" final_scores = df\n", | |
" return(final_scores)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def generate_probs(model, year):\n", | |
" data_dir = './March Madness 2018/'\n", | |
" sample = pd.read_csv(data_dir + 'SampleSubmissionStage2.csv')\n", | |
"\n", | |
" sample['Season'] = sample.apply(lambda row: row['ID'][0:4], axis=1)\n", | |
" sample['Team_ID_Low'] = sample.apply(lambda row: row['ID'][5:9], axis=1)\n", | |
" sample['Team_ID_High'] = sample.apply(lambda row: row['ID'][10:14], axis=1)\n", | |
" sample.head()\n", | |
"\n", | |
" df = get_composite_scores_df(year)\n", | |
" df = final_scores\n", | |
" df['Season'] = df['Season'].astype(str)\n", | |
" df['Team_ID'] = df['Team_ID'].astype(str)\n", | |
"\n", | |
" df_lows = df.rename(columns={'Composite Score':'Score', 'Team_ID':'Team_ID_Low'})\n", | |
" df_highs = df.rename(columns={'Composite Score':'Score', 'Team_ID':'Team_ID_High'})\n", | |
"\n", | |
" df_dummy = pd.merge(left=sample, right=df_lows, how='left', on=['Season', 'Team_ID_Low'])\n", | |
" df_concat = pd.merge(left=df_dummy, right=df_highs, on=['Season', 'Team_ID_High'])\n", | |
" df_sample = df_concat.rename(columns={'Score_x':'Score_Low', 'Score_y':'Score_High'})\n", | |
" df_sample['Score_Diff'] = df_sample['Score_Low'] - df_sample['Score_High']\n", | |
" df_full = df_sample\n", | |
" df_sample = df_sample[['ID', 'Score_Low', 'Score_High', 'Score_Diff','Pred']]\n", | |
"\n", | |
" diffs = df_sample['Score_Diff'].values.reshape(-1,1)\n", | |
" probs = model.predict_proba(diffs)\n", | |
" Y_prob = [item[1] for item in probs]\n", | |
" df_sample['Pred'] = Y_prob\n", | |
" df_sample = df_sample[['ID', 'Pred']]\n", | |
" return(df_sample)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2018 Results EDA" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>DayNum</th>\n", | |
" <th>WTeamID</th>\n", | |
" <th>WScore</th>\n", | |
" <th>LTeamID</th>\n", | |
" <th>LScore</th>\n", | |
" <th>WLoc</th>\n", | |
" <th>NumOT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1985</td>\n", | |
" <td>20</td>\n", | |
" <td>1228</td>\n", | |
" <td>81</td>\n", | |
" <td>1328</td>\n", | |
" <td>64</td>\n", | |
" <td>N</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1106</td>\n", | |
" <td>77</td>\n", | |
" <td>1354</td>\n", | |
" <td>70</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1112</td>\n", | |
" <td>63</td>\n", | |
" <td>1223</td>\n", | |
" <td>56</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1165</td>\n", | |
" <td>70</td>\n", | |
" <td>1432</td>\n", | |
" <td>54</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1985</td>\n", | |
" <td>25</td>\n", | |
" <td>1192</td>\n", | |
" <td>86</td>\n", | |
" <td>1447</td>\n", | |
" <td>74</td>\n", | |
" <td>H</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT\n", | |
"0 1985 20 1228 81 1328 64 N 0\n", | |
"1 1985 25 1106 77 1354 70 H 0\n", | |
"2 1985 25 1112 63 1223 56 H 0\n", | |
"3 1985 25 1165 70 1432 54 H 0\n", | |
"4 1985 25 1192 86 1447 74 H 0" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Import data\n", | |
"stage2_dir = './March Madness 2018/Stage2UpdatedDataFiles/'\n", | |
"df = pd.read_csv(stage2_dir + 'RegularSeasonCompactResults.csv')\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>Team_ID</th>\n", | |
" <th>Composite Score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2018</td>\n", | |
" <td>1101</td>\n", | |
" <td>0.295815</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2018</td>\n", | |
" <td>1102</td>\n", | |
" <td>0.328263</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2018</td>\n", | |
" <td>1103</td>\n", | |
" <td>0.292680</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2018</td>\n", | |
" <td>1104</td>\n", | |
" <td>0.780949</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2018</td>\n", | |
" <td>1105</td>\n", | |
" <td>0.000265</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season Team_ID Composite Score\n", | |
"0 2018 1101 0.295815\n", | |
"1 2018 1102 0.328263\n", | |
"2 2018 1103 0.292680\n", | |
"3 2018 1104 0.780949\n", | |
"4 2018 1105 0.000265" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Run composite score functions\n", | |
"final_scores = get_composite_scores_df(2018)\n", | |
"final_scores.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Pull team names and format data\n", | |
"final_teams = final_scores\n", | |
"pd.options.display.float_format = '{:.3f}'.format\n", | |
"\n", | |
"final_teams['Team Name'] = None\n", | |
"for index, rows in final_teams.iterrows():\n", | |
" final_teams['Team Name'][index] = get_team_name(final_teams['Team_ID'][index])\n", | |
"\n", | |
"final_teams = final_teams[['Season', 'Team Name', 'Composite Score']]\n", | |
"final_teams.sort_values(by='Composite Score', ascending = False, inplace = True)\n", | |
"\n", | |
"final_teams.reset_index(inplace = True, drop = True)\n", | |
"final_teams.index += 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Season</th>\n", | |
" <th>Team Name</th>\n", | |
" <th>Composite Score</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2018</td>\n", | |
" <td>Villanova</td>\n", | |
" <td>1.000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2018</td>\n", | |
" <td>Virginia</td>\n", | |
" <td>0.997</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2018</td>\n", | |
" <td>Cincinnati</td>\n", | |
" <td>0.976</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2018</td>\n", | |
" <td>Gonzaga</td>\n", | |
" <td>0.975</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>2018</td>\n", | |
" <td>Duke</td>\n", | |
" <td>0.962</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>2018</td>\n", | |
" <td>Purdue</td>\n", | |
" <td>0.960</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>2018</td>\n", | |
" <td>Michigan St</td>\n", | |
" <td>0.953</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>2018</td>\n", | |
" <td>Michigan</td>\n", | |
" <td>0.947</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>2018</td>\n", | |
" <td>North Carolina</td>\n", | |
" <td>0.930</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>2018</td>\n", | |
" <td>Kansas</td>\n", | |
" <td>0.926</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>2018</td>\n", | |
" <td>Xavier</td>\n", | |
" <td>0.924</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>2018</td>\n", | |
" <td>Houston</td>\n", | |
" <td>0.919</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>2018</td>\n", | |
" <td>Arizona</td>\n", | |
" <td>0.915</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>2018</td>\n", | |
" <td>Tennessee</td>\n", | |
" <td>0.913</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>2018</td>\n", | |
" <td>Wichita St</td>\n", | |
" <td>0.909</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>2018</td>\n", | |
" <td>Texas Tech</td>\n", | |
" <td>0.900</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>2018</td>\n", | |
" <td>West Virginia</td>\n", | |
" <td>0.897</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>18</th>\n", | |
" <td>2018</td>\n", | |
" <td>Kentucky</td>\n", | |
" <td>0.895</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>19</th>\n", | |
" <td>2018</td>\n", | |
" <td>Ohio St</td>\n", | |
" <td>0.888</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>20</th>\n", | |
" <td>2018</td>\n", | |
" <td>Auburn</td>\n", | |
" <td>0.883</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>21</th>\n", | |
" <td>2018</td>\n", | |
" <td>Nevada</td>\n", | |
" <td>0.881</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>2018</td>\n", | |
" <td>Clemson</td>\n", | |
" <td>0.871</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>23</th>\n", | |
" <td>2018</td>\n", | |
" <td>St Mary's CA</td>\n", | |
" <td>0.856</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>24</th>\n", | |
" <td>2018</td>\n", | |
" <td>TCU</td>\n", | |
" <td>0.856</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25</th>\n", | |
" <td>2018</td>\n", | |
" <td>Florida</td>\n", | |
" <td>0.854</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Season Team Name Composite Score\n", | |
"1 2018 Villanova 1.000\n", | |
"2 2018 Virginia 0.997\n", | |
"3 2018 Cincinnati 0.976\n", | |
"4 2018 Gonzaga 0.975\n", | |
"5 2018 Duke 0.962\n", | |
"6 2018 Purdue 0.960\n", | |
"7 2018 Michigan St 0.953\n", | |
"8 2018 Michigan 0.947\n", | |
"9 2018 North Carolina 0.930\n", | |
"10 2018 Kansas 0.926\n", | |
"11 2018 Xavier 0.924\n", | |
"12 2018 Houston 0.919\n", | |
"13 2018 Arizona 0.915\n", | |
"14 2018 Tennessee 0.913\n", | |
"15 2018 Wichita St 0.909\n", | |
"16 2018 Texas Tech 0.900\n", | |
"17 2018 West Virginia 0.897\n", | |
"18 2018 Kentucky 0.895\n", | |
"19 2018 Ohio St 0.888\n", | |
"20 2018 Auburn 0.883\n", | |
"21 2018 Nevada 0.881\n", | |
"22 2018 Clemson 0.871\n", | |
"23 2018 St Mary's CA 0.856\n", | |
"24 2018 TCU 0.856\n", | |
"25 2018 Florida 0.854" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Look at rankings\n", | |
"final_teams.head(25)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"### Submission I" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2018_1104_1112</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2018_1104_1113</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2018_1104_1116</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2018_1104_1120</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2018_1104_1137</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Pred\n", | |
"0 2018_1104_1112 0.500\n", | |
"1 2018_1104_1113 0.500\n", | |
"2 2018_1104_1116 0.500\n", | |
"3 2018_1104_1120 0.500\n", | |
"4 2018_1104_1137 0.500" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Sample data\n", | |
"data_dir = './March Madness 2018/'\n", | |
"sample = pd.read_csv(data_dir + 'SampleSubmissionStage2.csv')\n", | |
"sample.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2018_1104_1112</td>\n", | |
" <td>0.237</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2018_1104_1113</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2018_1112_1113</td>\n", | |
" <td>0.763</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2018_1104_1116</td>\n", | |
" <td>0.400</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2018_1112_1116</td>\n", | |
" <td>0.682</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Pred\n", | |
"0 2018_1104_1112 0.237\n", | |
"1 2018_1104_1113 0.500\n", | |
"2 2018_1112_1113 0.763\n", | |
"3 2018_1104_1116 0.400\n", | |
"4 2018_1112_1116 0.682" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Get probability dataframe\n", | |
"mod1 = pickle.load(open('ncaa_tourney1.pkl', 'rb'))\n", | |
"pred = generate_probs(mod1, 2018)\n", | |
"pred.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Write to csv\n", | |
"pred.to_csv('stage2_submission1.csv', index=None)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"### Submission II" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style>\n", | |
" .dataframe thead tr:only-child th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ID</th>\n", | |
" <th>Pred</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2018_1104_1112</td>\n", | |
" <td>0.244</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2018_1104_1113</td>\n", | |
" <td>0.500</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2018_1112_1113</td>\n", | |
" <td>0.755</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>2018_1104_1116</td>\n", | |
" <td>0.404</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2018_1112_1116</td>\n", | |
" <td>0.677</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ID Pred\n", | |
"0 2018_1104_1112 0.244\n", | |
"1 2018_1104_1113 0.500\n", | |
"2 2018_1112_1113 0.755\n", | |
"3 2018_1104_1116 0.404\n", | |
"4 2018_1112_1116 0.677" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Get probability dataframe\n", | |
"mod2 = pickle.load(open('ncaa_tourney2.pkl', 'rb'))\n", | |
"pred = generate_probs(mod2, 2018)\n", | |
"pred.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# Write to csv\n", | |
"pred.to_csv('stage2_submission2.csv', index=None)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment