Skip to content

Instantly share code, notes, and snippets.

@wcneill
Created May 1, 2020 16:48
Show Gist options
  • Save wcneill/572e8c83ffb4bf253706f2ae5112dce8 to your computer and use it in GitHub Desktop.
Save wcneill/572e8c83ffb4bf253706f2ae5112dce8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# regular expressions\n",
"import re \n",
"\n",
"# math and data utilities\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scipy.stats as ss\n",
"import itertools\n",
"\n",
"# data and statistics libraries\n",
"import sklearn.preprocessing as pre\n",
"from sklearn import model_selection\n",
"from sklearn import metrics\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.svm import SVC\n",
"\n",
"# visualization libraries\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib as mpl\n",
"import seaborn as sns\n",
"\n",
"# Set-up default visualization parameters\n",
"mpl.rcParams['figure.figsize'] = [10,6]\n",
"viz_dict = {\n",
" 'axes.titlesize':18,\n",
" 'axes.labelsize':16,\n",
"}\n",
"sns.set_context(\"notebook\", rc=viz_dict)\n",
"sns.set_style(\"whitegrid\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Initial Setup\n",
"We can download the data from Kaggle to our data folder using the command line:\n",
"\n",
"`kaggle competitions download -c titanic`\n",
"\n",
"`unzip titanic.zip`\n",
"\n",
"After that, let's get the data into some Pandas dataframes:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"train_df = pd.read_csv('data/train.csv', index_col='PassengerId')\n",
"test_df = pd.read_csv('data/test.csv', index_col='PassengerId')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploratory Data Analysis:\n",
"\n",
"Our next step will be to ask and answer the following questions:\n",
"\n",
"1. _Are we missing any data?_ \n",
"2. _What form does our data take?_\n",
"3. _What additional information can we garner from what we already have?_\n",
"4. _What relationships can we find between our variables, especially between the input and output variables?_ \n",
"5. _How can we use the answers to the first two question to add value to our data and the models that will use it?_\n",
"\n",
"\n",
"## Question 1: What Are We Missing?\n",
"Let's take a look at the number of entries in our training data, as well as those variables contain significant missing data. Below, we see that the training data contains 891 passenger samples, with 11 total variables describing each passenger. We see that there is a significant amount of missing data for the variables __Age__ and __Cabin__. We will have to deal with this missing data by either finding an intelligent way to fill the gaps, or perhaps dropping the features entirely. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 891 entries, 1 to 891\n",
"Data columns (total 11 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Survived 891 non-null int64 \n",
" 1 Pclass 891 non-null int64 \n",
" 2 Name 891 non-null object \n",
" 3 Sex 891 non-null object \n",
" 4 Age 714 non-null float64\n",
" 5 SibSp 891 non-null int64 \n",
" 6 Parch 891 non-null int64 \n",
" 7 Ticket 891 non-null object \n",
" 8 Fare 891 non-null float64\n",
" 9 Cabin 204 non-null object \n",
" 10 Embarked 889 non-null object \n",
"dtypes: float64(2), int64(4), object(5)\n",
"memory usage: 83.5+ KB\n"
]
}
],
"source": [
"# Question 1: Are we missing any data?\n",
"train_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 2: What is the form of our data?\n",
"Taking a look at our `.info()` print out as well as the first few entries of our data frame below, we see that our data comes primarily in the form of categorical data, with the exception of __Age__ and __Fare__. These categories are described by Python strings, which is why the data type above is listed as 'object'. This is how Pandas deals with unidentified data types. We will later tell Pandas that these variables are strings. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund, Mr. Owen Harris male 22.0 \n",
"2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n",
"3 Heikkinen, Miss. Laina female 26.0 \n",
"4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"5 Allen, Mr. William Henry male 35.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S \n",
"4 1 0 113803 53.1000 C123 S \n",
"5 0 0 373450 8.0500 NaN S "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Look at the first few entries\n",
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 3: What additional information can we garner from what we already have?\n",
"\n",
"### Passenger Title\n",
" \n",
"A quick glance at the __Name__ variable shows us that each name comes with a title. A title is useful in telling us things like social status, marriage status career, and even rank within a specific career. Therefore, it may be useful to have this information on hand. Let's parse it out:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Mr 517\n",
"Miss 182\n",
"Mrs 125\n",
"Master 40\n",
"Dr 7\n",
"Rev 6\n",
"Col 2\n",
"Mlle 2\n",
"Major 2\n",
"Don 1\n",
"Capt 1\n",
"Sir 1\n",
"Jonkheer 1\n",
"Countess 1\n",
"Mme 1\n",
"Ms 1\n",
"Lady 1\n",
"Name: Title, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df['Title'] = train_df['Name'].str.extract(r'([A-Za-z]+)\\.')\n",
"train_df.Title.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we might notice that many of these titles are synonymous. For example, _Mme_ is the French equivalent to 'Mrs' and _Mlle_ is the equivalent to 'Miss'. Other titles imply varying levels of nobility like 'Sir', 'Countess' and 'Don'. Some titles infer a profession. Let's reduce our titles to their common denominators:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"title_dict = {\n",
" 'Mrs': 'Mrs', 'Lady': 'Lady', 'Countess': 'Lady',\n",
" 'Jonkheer': 'Lord', 'Col': 'Officer', 'Rev': 'Rev',\n",
" 'Miss': 'Miss', 'Mlle': 'Miss', 'Mme': 'Mrs', 'Ms': 'Miss', 'Dona': 'Lady',\n",
" 'Mr': 'Mr', 'Dr': 'Dr', 'Major': 'Officer', 'Capt': 'Officer', 'Sir': 'Lord', 'Don': 'Lord', 'Master': 'Master'\n",
"}\n",
"\n",
"train_df.Title = train_df.Title.map(title_dict)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Histogram of Categorical Data: Title')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(train_df.Title).set_title(\"Histogram of Categorical Data: Title\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.25</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Mr</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Name Sex Age SibSp \\\n",
"PassengerId \n",
"1 0 3 Braund, Mr. Owen Harris male 22.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Title \n",
"PassengerId \n",
"1 0 A/5 21171 7.25 NaN S Mr "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By again looking at our data set, we might notice the variables __SibSp__ and __Parch__. The first is the number of siblings and/or spouses that a passenger traveled with. The second is the number of parents and/or children a passenger traveled with. Combining these two variables we can get total family size. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"train_df['FamilySize'] = 1 + train_df.SibSp + train_df.Parch"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x22a8c07c5b0>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(train_df.FamilySize)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It appears that the Titanic's voyage was not necessarily a couple's or family affair. The majority of passengers traveled alone, and perhaps that is valuable information. Let's add the category __Alone__."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x22a8be929d0>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 576x360 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"train_df['Alone'] = train_df.FamilySize.apply(lambda x: 1 if x==1 else 0)\n",
"plt.figure(figsize=(8,5))\n",
"sns.countplot(train_df.Alone)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Last Name\n",
"\n",
"A last name is a group identity. While we know that many passengers traveled alone, there were still a significant number of families onboard the Titanic. Perhaps survival among specific families was more common than others. This is all speculation, but perhaps worth a look."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"train_df['LName'] = train_df.Name.str.extract(r'([A-Za-z]+),')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Name Length\n",
"\n",
"This one has a very simple explanation: While reviewing notebooks on Kaggle, I saw that one competitor found that the length of a person's name added to the performance of the model. So, why not try it out?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" <th>FamilySize</th>\n",
" <th>Alone</th>\n",
" <th>LName</th>\n",
" <th>NameLength</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Mr</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Braund</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>Mrs</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Cumings</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Miss</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Heikkinen</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>Mrs</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Futrelle</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Mr</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Allen</td>\n",
" <td>24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Rev</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Montvila</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" <td>Miss</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Miss</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>Johnston</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" <td>Mr</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>891</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" <td>Mr</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Dooley</td>\n",
" <td>19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 16 columns</p>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"... ... ... \n",
"887 0 2 \n",
"888 1 1 \n",
"889 0 3 \n",
"890 1 1 \n",
"891 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund, Mr. Owen Harris male 22.0 \n",
"2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n",
"3 Heikkinen, Miss. Laina female 26.0 \n",
"4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"5 Allen, Mr. William Henry male 35.0 \n",
"... ... ... ... \n",
"887 Montvila, Rev. Juozas male 27.0 \n",
"888 Graham, Miss. Margaret Edith female 19.0 \n",
"889 Johnston, Miss. Catherine Helen \"Carrie\" female NaN \n",
"890 Behr, Mr. Karl Howell male 26.0 \n",
"891 Dooley, Mr. Patrick male 32.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked Title \\\n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S Mr \n",
"2 1 0 PC 17599 71.2833 C85 C Mrs \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S Miss \n",
"4 1 0 113803 53.1000 C123 S Mrs \n",
"5 0 0 373450 8.0500 NaN S Mr \n",
"... ... ... ... ... ... ... ... \n",
"887 0 0 211536 13.0000 NaN S Rev \n",
"888 0 0 112053 30.0000 B42 S Miss \n",
"889 1 2 W./C. 6607 23.4500 NaN S Miss \n",
"890 0 0 111369 30.0000 C148 C Mr \n",
"891 0 0 370376 7.7500 NaN Q Mr \n",
"\n",
" FamilySize Alone LName NameLength \n",
"PassengerId \n",
"1 2 0 Braund 23 \n",
"2 2 0 Cumings 51 \n",
"3 1 1 Heikkinen 22 \n",
"4 2 0 Futrelle 44 \n",
"5 1 1 Allen 24 \n",
"... ... ... ... ... \n",
"887 1 1 Montvila 21 \n",
"888 1 1 Graham 28 \n",
"889 4 0 Johnston 40 \n",
"890 1 1 Behr 21 \n",
"891 1 1 Dooley 19 \n",
"\n",
"[891 rows x 16 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df['NameLength'] = train_df.Name.apply(len)\n",
"train_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 4: What statistical relationships does our data contain?\n",
"\n",
"We now have a more robust data set that includes (possibly) valuable new insights into our passengers lives. But how helpful is this data, really? One way to find out is to look at the statistical relationships between our variables, especially between each input variable and our single output variable __Survived__. \n",
"\n",
"Correlation is a common go-to tool we would use to determine such relationships. However, it is important to note that we have mostly categorical data in our data set, and that throws a small wrench in our gears. \n",
"\n",
"First, our categorical data needs to be encoded into numeric format before we can do calculations of any kind. \n",
"\n",
"Next, we need to consider the types categorical we are studying:\n",
"\n",
"- __Ordinal__ variables imply an underlying rank, or order. The classifications mild, moderate, severe would be an example. A common method of calculating the correlation between ordinal variables is called _Kendall's Tau ($\\tau$)_. \n",
"- __Nominal__ variables have no such rank or order. Examples might be male or female, cat or dog. In this case we will use _Cramer's V_ to determine association."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" <th>FamilySize</th>\n",
" <th>Alone</th>\n",
" <th>LName</th>\n",
" <th>NameLength</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.25</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>Mr</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Braund</td>\n",
" <td>23</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Name Sex Age SibSp \\\n",
"PassengerId \n",
"1 0 3 Braund, Mr. Owen Harris male 22.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Title FamilySize Alone \\\n",
"PassengerId \n",
"1 0 A/5 21171 7.25 NaN S Mr 2 0 \n",
"\n",
" LName NameLength \n",
"PassengerId \n",
"1 Braund 23 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# nominal variables (use Cramer's V)\n",
"nom_vars = ['Survived', 'Title', 'Embarked', 'Sex', 'Alone', 'LName']\n",
"\n",
"# ordinal variables (nominal-ordinal, use Rank Biserial or Kendall's Tau)\n",
"ord_vars = ['Survived', 'Pclass', 'FamilySize', 'Parch', 'SibSp', 'NameLength']\n",
"\n",
"# continuous variables (use Pearson's r)\n",
"cont_vars = ['Survived', 'Fare', 'Age']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the cell above, we separate our variables by their data types. The reason for this is that when considering the underlying associations between variables, there is not a \"one size fits all\" method. The most common mathematical method of calculating correlation is _Pearson's r_, which should typically only be used on continuous variables. In our case, the vast majority of variables are actually discrete/categorical. \n",
"\n",
"In order to perform calculations, we must convert any non numeric data into numbers. You can't do math on words, so let's get started:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# convert all string 'object' types to numeric categories\n",
"for i in train_df.columns:\n",
" if train_df[i].dtype == 'object':\n",
" train_df[i], _ = pd.factorize(train_df[i])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# A method that creates a correlation matrix in the form of a Pandas DataFrame using Cramer's V.\n",
"def cramers_v_matrix(dataframe, variables):\n",
" \n",
" df = pd.DataFrame(index=dataframe[variables].columns, columns=dataframe[variables].columns, dtype=\"float64\")\n",
" \n",
" for v1, v2 in itertools.combinations(variables, 2):\n",
" \n",
" # generate contingency table:\n",
" table = pd.crosstab(dataframe[v1], dataframe[v2])\n",
" n = len(dataframe.index)\n",
" r, k = table.shape\n",
" \n",
" # calculate chi squared and phi\n",
" chi2 = ss.chi2_contingency(table)[0]\n",
" phi2 = chi2/n\n",
" \n",
" # bias corrections:\n",
" r = r - ((r - 1)**2)/(n - 1)\n",
" k = k - ((k - 1)**2)/(n - 1)\n",
" phi2 = max(0, phi2 - (k - 1)*(r - 1)/(n - 1))\n",
" \n",
" # fill correlation matrix\n",
" df.loc[v1, v2] = np.sqrt(phi2/min(k - 1, r - 1))\n",
" df.loc[v2, v1] = np.sqrt(phi2/min(k - 1, r - 1))\n",
" np.fill_diagonal(df.values, np.ones(len(df)))\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, \"Pearson's R Correlation\")"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x432 with 6 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig, axes = plt.subplots(1, 3, figsize=(20,6))\n",
"\n",
"# nominal variable correlation\n",
"ax1 = sns.heatmap(cramers_v_matrix(train_df, nom_vars), annot=True, ax=axes[0], vmin=0)\n",
"\n",
"# ordinal variable correlation: \n",
"ax2 = sns.heatmap(train_df[ord_vars].corr(method='kendall'), annot=True, ax=axes[1], vmin=-1)\n",
"\n",
"# Pearson's correlation:\n",
"ax3 = sns.heatmap(train_df[cont_vars].corr(), annot=True, ax=axes[2], vmin=-1)\n",
"\n",
"ax1.set_title(\"Cramer's V Correlation\")\n",
"ax2.set_title(\"Kendall's Tau Correlation\")\n",
"ax3.set_title(\"Pearson's R Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above heatmaps show our strength of association between each variable. While there is no rigid standard for \"Highly Associated\" or \"Weakly Associated\", we will use a cut-off value of |0.1| between our independent variables and survival. We will likely drop features whose association is lower than |0.1|. This is an entirely arbitrary guess, and I may return to raise or lower the bar later (In fact, I decided to keep Age after noticing improved performance when I did). \n",
"\n",
"For now, the feature that meets the criteria for dropping are __SibSp__. Additionally, I am choosing to drop __Name__, __Ticket__ and __Cabin__, mostly on a hunch that they don't add much. \n",
"\n",
"It should be noted that correlation between independent (predictor) variables can mean redundant information. This can cause a drop in performance in some algorithms. Some might choose to drop highly correlated predictors. I did not take the time to do that, but you might try it at home!"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" <th>FamilySize</th>\n",
" <th>Alone</th>\n",
" <th>LName</th>\n",
" <th>NameLength</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>22.0</td>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>38.0</td>\n",
" <td>0</td>\n",
" <td>71.2833</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>7.9250</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>53.1000</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>13.0000</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>663</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>30.0000</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>233</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>23.4500</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>603</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>30.0000</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>664</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>891</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>7.7500</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>665</td>\n",
" <td>19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" Survived Pclass Sex Age Parch Fare Embarked Title \\\n",
"PassengerId \n",
"1 0 3 0 22.0 0 7.2500 0 0 \n",
"2 1 1 1 38.0 0 71.2833 1 1 \n",
"3 1 3 1 26.0 0 7.9250 0 2 \n",
"4 1 1 1 35.0 0 53.1000 0 1 \n",
"5 0 3 0 35.0 0 8.0500 0 0 \n",
"... ... ... ... ... ... ... ... ... \n",
"887 0 2 0 27.0 0 13.0000 0 5 \n",
"888 1 1 1 19.0 0 30.0000 0 2 \n",
"889 0 3 1 NaN 2 23.4500 0 2 \n",
"890 1 1 0 26.0 0 30.0000 1 0 \n",
"891 0 3 0 32.0 0 7.7500 2 0 \n",
"\n",
" FamilySize Alone LName NameLength \n",
"PassengerId \n",
"1 2 0 0 23 \n",
"2 2 0 1 51 \n",
"3 1 1 2 22 \n",
"4 2 0 3 44 \n",
"5 1 1 4 24 \n",
"... ... ... ... ... \n",
"887 1 1 663 21 \n",
"888 1 1 233 28 \n",
"889 4 0 603 40 \n",
"890 1 1 664 21 \n",
"891 1 1 665 19 \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"todrop = ['SibSp', 'Ticket', 'Cabin', 'Name']\n",
"train_df = train_df.drop(todrop, axis=1)\n",
"train_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup for Machine Learning:\n",
"\n",
"During this phase, we will begin to format our data for feeding into a machine learning algorithm. We will then use this formatted data to get a picture of what a few different models can do for us, and pick the best one. This phase is broken into the following parts:\n",
"\n",
"1. __Train/Test Split__\n",
"2. __Normalize Data of each split__ \n",
"3. __Impute missing values__ \n",
"\n",
"Let's go.\n",
"\n",
"## Train/Test Split\n",
"\n",
"We will split our data once into training and testing sets. Within the training set, we will use stratified k-fold cross validation to find average performance of our models. \n",
"\n",
"The test set will not be touched until after we have fully tuned each of our candidate models using the training data and k-fold cross validation. Once training and tuning is complete, we will compare the results of each model on the held-out test set. The one that performs the best will be used for the competition."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"X = train_df.drop(['Survived'], axis = 1)\n",
"Y = train_df.loc[:, 'Survived']\n",
"\n",
"x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.2, random_state=333)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalizing the Data\n",
"\n",
"Some Machine Learning models require all of our predictors to be on the same scale, while others do not. Most notably, models like Logistic Regression and SVM will probably benefit from scaling, while decision trees will simply ignore scaling. Because we are going to be looking at a mixed bag of algorithms, I'm going to go ahead and scale our data."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" <th>FamilySize</th>\n",
" <th>Alone</th>\n",
" <th>LName</th>\n",
" <th>NameLength</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>466</th>\n",
" <td>0.810433</td>\n",
" <td>-0.751555</td>\n",
" <td>0.606253</td>\n",
" <td>-0.454384</td>\n",
" <td>-0.503686</td>\n",
" <td>-0.557754</td>\n",
" <td>-0.677693</td>\n",
" <td>-0.563259</td>\n",
" <td>0.802709</td>\n",
" <td>0.474261</td>\n",
" <td>0.412271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>508</th>\n",
" <td>-1.624285</td>\n",
" <td>-0.751555</td>\n",
" <td>NaN</td>\n",
" <td>-0.454384</td>\n",
" <td>-0.074801</td>\n",
" <td>-0.557754</td>\n",
" <td>-0.677693</td>\n",
" <td>-0.563259</td>\n",
" <td>0.802709</td>\n",
" <td>0.630255</td>\n",
" <td>1.883636</td>\n",
" </tr>\n",
" <tr>\n",
" <th>316</th>\n",
" <td>0.810433</td>\n",
" <td>1.330574</td>\n",
" <td>-0.234031</td>\n",
" <td>-0.454384</td>\n",
" <td>-0.485998</td>\n",
" <td>-0.557754</td>\n",
" <td>0.974764</td>\n",
" <td>-0.563259</td>\n",
" <td>0.802709</td>\n",
" <td>-0.102915</td>\n",
" <td>0.412271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>341</th>\n",
" <td>-0.406926</td>\n",
" <td>-0.751555</td>\n",
" <td>-1.914600</td>\n",
" <td>0.771076</td>\n",
" <td>-0.086898</td>\n",
" <td>-0.557754</td>\n",
" <td>1.800993</td>\n",
" <td>0.749475</td>\n",
" <td>-1.245781</td>\n",
" <td>-0.830884</td>\n",
" <td>0.307174</td>\n",
" </tr>\n",
" <tr>\n",
" <th>118</th>\n",
" <td>-0.406926</td>\n",
" <td>-0.751555</td>\n",
" <td>-0.023960</td>\n",
" <td>-0.454384</td>\n",
" <td>-0.196868</td>\n",
" <td>-0.557754</td>\n",
" <td>-0.677693</td>\n",
" <td>0.093108</td>\n",
" <td>-1.245781</td>\n",
" <td>-1.324864</td>\n",
" <td>0.412271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>0.810433</td>\n",
" <td>-0.751555</td>\n",
" <td>NaN</td>\n",
" <td>-0.454384</td>\n",
" <td>-0.317836</td>\n",
" <td>2.631970</td>\n",
" <td>-0.677693</td>\n",
" <td>0.093108</td>\n",
" <td>-1.245781</td>\n",
" <td>-1.298865</td>\n",
" <td>-1.059093</td>\n",
" </tr>\n",
" <tr>\n",
" <th>375</th>\n",
" <td>0.810433</td>\n",
" <td>1.330574</td>\n",
" <td>-1.844576</td>\n",
" <td>0.771076</td>\n",
" <td>-0.195219</td>\n",
" <td>-0.557754</td>\n",
" <td>0.974764</td>\n",
" <td>2.062208</td>\n",
" <td>-1.245781</td>\n",
" <td>-1.491257</td>\n",
" <td>-0.113216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>367</th>\n",
" <td>-1.624285</td>\n",
" <td>1.330574</td>\n",
" <td>2.146774</td>\n",
" <td>-0.454384</td>\n",
" <td>0.996311</td>\n",
" <td>1.037108</td>\n",
" <td>0.148535</td>\n",
" <td>0.093108</td>\n",
" <td>-1.245781</td>\n",
" <td>0.084277</td>\n",
" <td>2.198929</td>\n",
" </tr>\n",
" <tr>\n",
" <th>420</th>\n",
" <td>0.810433</td>\n",
" <td>1.330574</td>\n",
" <td>-1.354410</td>\n",
" <td>1.996537</td>\n",
" <td>-0.127587</td>\n",
" <td>-0.557754</td>\n",
" <td>0.974764</td>\n",
" <td>0.749475</td>\n",
" <td>-1.245781</td>\n",
" <td>0.276669</td>\n",
" <td>-0.218313</td>\n",
" </tr>\n",
" <tr>\n",
" <th>781</th>\n",
" <td>0.810433</td>\n",
" <td>1.330574</td>\n",
" <td>-1.144339</td>\n",
" <td>-0.454384</td>\n",
" <td>-0.499744</td>\n",
" <td>1.037108</td>\n",
" <td>0.974764</td>\n",
" <td>-0.563259</td>\n",
" <td>0.802709</td>\n",
" <td>1.597414</td>\n",
" <td>-0.743801</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>712 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" Pclass Sex Age Parch Fare Embarked \\\n",
"PassengerId \n",
"466 0.810433 -0.751555 0.606253 -0.454384 -0.503686 -0.557754 \n",
"508 -1.624285 -0.751555 NaN -0.454384 -0.074801 -0.557754 \n",
"316 0.810433 1.330574 -0.234031 -0.454384 -0.485998 -0.557754 \n",
"341 -0.406926 -0.751555 -1.914600 0.771076 -0.086898 -0.557754 \n",
"118 -0.406926 -0.751555 -0.023960 -0.454384 -0.196868 -0.557754 \n",
"... ... ... ... ... ... ... \n",
"47 0.810433 -0.751555 NaN -0.454384 -0.317836 2.631970 \n",
"375 0.810433 1.330574 -1.844576 0.771076 -0.195219 -0.557754 \n",
"367 -1.624285 1.330574 2.146774 -0.454384 0.996311 1.037108 \n",
"420 0.810433 1.330574 -1.354410 1.996537 -0.127587 -0.557754 \n",
"781 0.810433 1.330574 -1.144339 -0.454384 -0.499744 1.037108 \n",
"\n",
" Title FamilySize Alone LName NameLength \n",
"PassengerId \n",
"466 -0.677693 -0.563259 0.802709 0.474261 0.412271 \n",
"508 -0.677693 -0.563259 0.802709 0.630255 1.883636 \n",
"316 0.974764 -0.563259 0.802709 -0.102915 0.412271 \n",
"341 1.800993 0.749475 -1.245781 -0.830884 0.307174 \n",
"118 -0.677693 0.093108 -1.245781 -1.324864 0.412271 \n",
"... ... ... ... ... ... \n",
"47 -0.677693 0.093108 -1.245781 -1.298865 -1.059093 \n",
"375 0.974764 2.062208 -1.245781 -1.491257 -0.113216 \n",
"367 0.148535 0.093108 -1.245781 0.084277 2.198929 \n",
"420 0.974764 0.749475 -1.245781 0.276669 -0.218313 \n",
"781 0.974764 -0.563259 0.802709 1.597414 -0.743801 \n",
"\n",
"[712 rows x 11 columns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We normalize the training and testing data separately so as to avoid data leaks.\n",
"\n",
"x_train = pd.DataFrame(pre.scale(x_train), columns=x_train.columns, index=x_train.index)\n",
"x_test = pd.DataFrame(pre.scale(x_test), columns=x_test.columns, index=x_test.index)\n",
"x_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imputing Missing Data\n",
"\n",
"You might recall that there were a significant amount of missing __Age__ values in our data. Let's fill this in with the median age:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 712 entries, 466 to 781\n",
"Data columns (total 11 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Pclass 712 non-null float64\n",
" 1 Sex 712 non-null float64\n",
" 2 Age 712 non-null float64\n",
" 3 Parch 712 non-null float64\n",
" 4 Fare 712 non-null float64\n",
" 5 Embarked 712 non-null float64\n",
" 6 Title 712 non-null float64\n",
" 7 FamilySize 712 non-null float64\n",
" 8 Alone 712 non-null float64\n",
" 9 LName 712 non-null float64\n",
" 10 NameLength 712 non-null float64\n",
"dtypes: float64(11)\n",
"memory usage: 66.8 KB\n"
]
}
],
"source": [
"x_train.loc[x_train.Age.isnull(), 'Age'] = x_train.loc[:, 'Age'].median()\n",
"x_test.loc[x_test.Age.isnull(), 'Age'] = x_test.loc[:, 'Age'].median()\n",
"x_train.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we see that each and every variable that we chose to keep has 712 valid data entries. \n",
"\n",
"# Model Selection\n",
"\n",
"Now that we have prepared our data, we want to look at different options available to us for solving classification problems. Some common ones are:\n",
"\n",
"- K-Nearest Neighbors\n",
"- Support Vector Machines\n",
"- Decision Trees\n",
"- Logistic Regression\n",
"\n",
"We will train and tune each of these models on our training data by way of k-fold cross-validation. When complete, we will compare the tuned models' performance on a held out test set. \n",
"\n",
"## Training and Comparing Base Models:\n",
"\n",
"First, we want to get a feel model's performance before tuning. We will write two functions to help us describe our results. The first will evaluate the model several times over random splits in the data, and return the average performance as a dictionary. The second will simply nicely print our dictionary."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def kfold_evaluate(model, folds=5):\n",
" eval_dict = {}\n",
" accuracy = 0\n",
" f1 = 0\n",
" AUC = 0\n",
" \n",
" skf = model_selection.StratifiedKFold(n_splits=folds)\n",
" \n",
" # perform k splits on the training data. Gather performance results.\n",
" for train_idx, test_idx in skf.split(x_train, y_train):\n",
" xk_train, xk_test = x_train.iloc[train_idx], x_train.iloc[test_idx]\n",
" yk_train, yk_test = y_train.iloc[train_idx], y_train.iloc[test_idx]\n",
" \n",
" model.fit(xk_train, yk_train)\n",
" y_pred = model.predict(xk_test)\n",
" report = metrics.classification_report(yk_test, y_pred, output_dict=True)\n",
" \n",
" prob_array = model.predict_proba(xk_test)\n",
" \n",
" fpr, tpr, huh = metrics.roc_curve(yk_test, model.predict_proba(xk_test)[:,1])\n",
" auc = metrics.auc(fpr, tpr)\n",
" accuracy += report['accuracy']\n",
" f1 += report['macro avg']['f1-score']\n",
" AUC += auc\n",
" \n",
" # Average performance metrics over the k folds\n",
" measures = np.array([accuracy, f1, AUC])\n",
" measures = measures/folds\n",
"\n",
" # Add metric averages to dictionary and return.\n",
" eval_dict['Accuracy'] = measures[0]\n",
" eval_dict['F1 Score'] = measures[1]\n",
" eval_dict['AUC'] = measures[2] \n",
" eval_dict['Model'] = model\n",
" \n",
" return eval_dict\n",
"\n",
"# a function to pretty print our dictionary of dictionaries:\n",
"def pprint(web, level):\n",
" for k,v in web.items():\n",
" if isinstance(v, dict):\n",
" print('\\t'*level, f'{k}: ')\n",
" level += 1\n",
" pprint(v, level)\n",
" level -= 1\n",
" else:\n",
" print('\\t'*level, k, \": \", v)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"evals = {}\n",
"evals['KNN'] = kfold_evaluate(KNeighborsClassifier())\n",
"evals['Logistic Regression'] = kfold_evaluate(LogisticRegression(max_iter=1000))\n",
"evals['Random Forest'] = kfold_evaluate(RandomForestClassifier())\n",
"evals['SVC'] = kfold_evaluate(SVC(probability=True))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>KNN</th>\n",
" <th>Logistic Regression</th>\n",
" <th>Random Forest</th>\n",
" <th>SVC</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Accuracy</th>\n",
" <td>0.8216</td>\n",
" <td>0.817394</td>\n",
" <td>0.8399</td>\n",
" <td>0.835625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>F1 Score</th>\n",
" <td>0.808083</td>\n",
" <td>0.805034</td>\n",
" <td>0.828417</td>\n",
" <td>0.823467</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AUC</th>\n",
" <td>0.842471</td>\n",
" <td>0.865725</td>\n",
" <td>0.870281</td>\n",
" <td>0.848343</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Model</th>\n",
" <td>KNeighborsClassifier(algorithm='auto', leaf_si...</td>\n",
" <td>LogisticRegression(C=1.0, class_weight=None, d...</td>\n",
" <td>(DecisionTreeClassifier(ccp_alpha=0.0, class_w...</td>\n",
" <td>SVC(C=1.0, break_ties=False, cache_size=200, c...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" KNN \\\n",
"Accuracy 0.8216 \n",
"F1 Score 0.808083 \n",
"AUC 0.842471 \n",
"Model KNeighborsClassifier(algorithm='auto', leaf_si... \n",
"\n",
" Logistic Regression \\\n",
"Accuracy 0.817394 \n",
"F1 Score 0.805034 \n",
"AUC 0.865725 \n",
"Model LogisticRegression(C=1.0, class_weight=None, d... \n",
"\n",
" Random Forest \\\n",
"Accuracy 0.8399 \n",
"F1 Score 0.828417 \n",
"AUC 0.870281 \n",
"Model (DecisionTreeClassifier(ccp_alpha=0.0, class_w... \n",
"\n",
" SVC \n",
"Accuracy 0.835625 \n",
"F1 Score 0.823467 \n",
"AUC 0.848343 \n",
"Model SVC(C=1.0, break_ties=False, cache_size=200, c... "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result_df = pd.DataFrame(evals)\n",
"result_df.drop('Model', axis=0).plot(kind='bar', ylim=(0.7, 0.9)).set_title(\"Base Model Performance\")\n",
"plt.xticks(rotation=0)\n",
"plt.show()\n",
"result_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Base Model Summary\n",
"\n",
"It appears that we have a clear winner in our Random Forest classifier. \n",
"\n",
"## Hyper-parameter Tuning: \n",
"\n",
"Let's tune up our current champion's hyper-parameters in hopes of eking out a little bit more performance. We will use scikit-learn's `RandomizedSearchCV` which has some speed advantages over using an exhaustive `GridSearchCV`. Our first step is to create our grid of parameters over which we will randomly search for the best settings:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" n_estimators : [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]\n",
" max_features : ['auto', 'sqrt']\n",
" max_depth : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None]\n",
" min_samples_split : [2, 5, 10]\n",
" min_samples_leaf : [1, 2, 4]\n",
" bootstrap : [True, False]\n"
]
}
],
"source": [
"# Number of trees in random forest\n",
"n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]\n",
"# Number of features to consider at every split\n",
"max_features = ['auto', 'sqrt']\n",
"# Maximum number of levels in tree\n",
"max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]\n",
"max_depth.append(None)\n",
"# Minimum number of samples required to split a node\n",
"min_samples_split = [2, 5, 10]\n",
"# Minimum number of samples required at each leaf node\n",
"min_samples_leaf = [1, 2, 4]\n",
"# Method of selecting samples for training each tree\n",
"bootstrap = [True, False]\n",
"# Create the random grid\n",
"random_grid = {'n_estimators': n_estimators, \n",
" 'max_features': max_features,\n",
" 'max_depth': max_depth,\n",
" 'min_samples_split': min_samples_split,\n",
" 'min_samples_leaf': min_samples_leaf,\n",
" 'bootstrap': bootstrap}\n",
"\n",
"pprint(random_grid, 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we want to create our `RandomizedSearchCV` object which will use the grid we just created above. It will randomly sample 10 combinations of parameters, test them over 3 folds and return the set of parameters that performed the best on our training data."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# create RandomizedSearchCV object\n",
"searcher = model_selection.RandomizedSearchCV(estimator = RandomForestClassifier(),\n",
" param_distributions = random_grid,\n",
" n_iter = 10, # Number of parameter settings to sample (this could take a while)\n",
" cv = 3, # Number of folds for k-fold validation \n",
" n_jobs = -1, # Use all processors to compute in parallel\n",
" random_state=0) \n",
"search = searcher.fit(x_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'n_estimators': 1600,\n",
" 'min_samples_split': 10,\n",
" 'min_samples_leaf': 4,\n",
" 'max_features': 'auto',\n",
" 'max_depth': 30,\n",
" 'bootstrap': False}"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"params = search.best_params_\n",
"params"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After performing our parameter tuning, we can verify whether or not the parameters provided by the search actually improve the base model or not. Let's compare the performance of the two models before and after tuning."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Tuned</th>\n",
" <th>Basic</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Accuracy</th>\n",
" <td>0.846902</td>\n",
" <td>0.838501</td>\n",
" </tr>\n",
" <tr>\n",
" <th>F1 Score</th>\n",
" <td>0.836075</td>\n",
" <td>0.826364</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AUC</th>\n",
" <td>0.878691</td>\n",
" <td>0.872901</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Model</th>\n",
" <td>(DecisionTreeClassifier(ccp_alpha=0.0, class_w...</td>\n",
" <td>(DecisionTreeClassifier(ccp_alpha=0.0, class_w...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Tuned \\\n",
"Accuracy 0.846902 \n",
"F1 Score 0.836075 \n",
"AUC 0.878691 \n",
"Model (DecisionTreeClassifier(ccp_alpha=0.0, class_w... \n",
"\n",
" Basic \n",
"Accuracy 0.838501 \n",
"F1 Score 0.826364 \n",
"AUC 0.872901 \n",
"Model (DecisionTreeClassifier(ccp_alpha=0.0, class_w... "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tuning_eval = {}\n",
"tuned_rf = RandomForestClassifier(**params)\n",
"basic_rf = RandomForestClassifier()\n",
"\n",
"tuning_eval['Tuned'] = kfold_evaluate(tuned_rf)\n",
"tuning_eval['Basic'] = kfold_evaluate(basic_rf)\n",
"\n",
"result_df = pd.DataFrame(tuning_eval)\n",
"result_df.drop('Model', axis=0).plot(kind='bar', ylim=(0.7, 0.9)).set_title(\"Tuning Performance\")\n",
"plt.xticks(rotation=0)\n",
"plt.show()\n",
"result_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Final Steps: \n",
"\n",
"Now that we have chosen and tuned a Random Forest classifier, we want to test it on data it has never before seen. This will tell us how we might expect the model to perform in the future, on new data. It's time to use that held out test set. \n",
"\n",
"Then, we will combine the test and training data, and re-fit our model to the combined data set, hopefully giving it the greatest chance of success on the unlabeled data from the competition. \n",
"\n",
"Finally, we will make our predictions on the unlabeled data for submission to the competition. \n",
"\n",
"### Final Test on Held Out Data"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"y_pred = tuned_rf.predict(x_test)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Died: \n",
"\t precision : 0.7815126050420168\n",
"\t recall : 0.8532110091743119\n",
"\t f1-score : 0.8157894736842106\n",
"\t support : 109\n",
" Survived: \n",
"\t precision : 0.7333333333333333\n",
"\t recall : 0.6285714285714286\n",
"\t f1-score : 0.6769230769230768\n",
"\t support : 70\n",
" accuracy : 0.7653631284916201\n",
" macro avg: \n",
"\t precision : 0.757422969187675\n",
"\t recall : 0.7408912188728702\n",
"\t f1-score : 0.7463562753036437\n",
"\t support : 179\n",
" weighted avg: \n",
"\t precision : 0.7626715490665541\n",
"\t recall : 0.7653631284916201\n",
"\t f1-score : 0.761484178861421\n",
"\t support : 179\n"
]
}
],
"source": [
"results = metrics.classification_report(y_test, y_pred,\n",
" labels = [0, 1],\n",
" target_names = ['Died', 'Survived'],\n",
" output_dict = True)\n",
"\n",
"pprint(results, 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we may have experienced some overfitting. Our model's performance on the test data is roughly 7-9% lower across the board. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combine Training and Testing Datasets for Final Model Fit\n",
"\n",
"Now that we have ascertained that our tuned model performs with about 76% accuracy and has an f1-score of 0.74 on new data, we can proceed to train our model on the entire labeled training set. "
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,\n",
" criterion='gini', max_depth=30, max_features='auto',\n",
" max_leaf_nodes=None, max_samples=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=4, min_samples_split=10,\n",
" min_weight_fraction_leaf=0.0, n_estimators=1600,\n",
" n_jobs=None, oob_score=False, random_state=None,\n",
" verbose=0, warm_start=False)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = pd.concat([x_train, x_test], axis=0).sort_index()\n",
"Y = pd.concat([y_train, y_test], axis=0).sort_index()\n",
"tuned_rf.fit(X, Y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Format and Standardize Unlabeled Data\n",
"\n",
"Next we need to transform our unlabeled data in the same manner as when we were formatting our training data. This includes encoding categorical variables, dropping the same features and normalization. This should ensure consistent results on the never before seen competition data. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Title</th>\n",
" <th>FamilySize</th>\n",
" <th>Alone</th>\n",
" <th>LName</th>\n",
" <th>NameLength</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>892</th>\n",
" <td>0.873482</td>\n",
" <td>-0.755929</td>\n",
" <td>0.386231</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.497413</td>\n",
" <td>-1.955941</td>\n",
" <td>-0.713830</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>-1.672873</td>\n",
" <td>-1.153019</td>\n",
" </tr>\n",
" <tr>\n",
" <th>893</th>\n",
" <td>0.873482</td>\n",
" <td>1.322876</td>\n",
" <td>1.371370</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.512278</td>\n",
" <td>-0.231082</td>\n",
" <td>0.207099</td>\n",
" <td>0.105643</td>\n",
" <td>-1.238278</td>\n",
" <td>-1.662873</td>\n",
" <td>0.453521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>894</th>\n",
" <td>-0.315819</td>\n",
" <td>-0.755929</td>\n",
" <td>2.553537</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.464100</td>\n",
" <td>-1.955941</td>\n",
" <td>-0.713830</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>-1.652873</td>\n",
" <td>-0.249340</td>\n",
" </tr>\n",
" <tr>\n",
" <th>895</th>\n",
" <td>0.873482</td>\n",
" <td>-0.755929</td>\n",
" <td>-0.204852</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.482475</td>\n",
" <td>-0.231082</td>\n",
" <td>-0.713830</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>-1.642873</td>\n",
" <td>-1.153019</td>\n",
" </tr>\n",
" <tr>\n",
" <th>896</th>\n",
" <td>0.873482</td>\n",
" <td>1.322876</td>\n",
" <td>-0.598908</td>\n",
" <td>0.619896</td>\n",
" <td>-0.417492</td>\n",
" <td>-0.231082</td>\n",
" <td>0.207099</td>\n",
" <td>0.764728</td>\n",
" <td>-1.238278</td>\n",
" <td>-1.632872</td>\n",
" <td>1.658426</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1305</th>\n",
" <td>0.873482</td>\n",
" <td>-0.755929</td>\n",
" <td>-0.204852</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.493455</td>\n",
" <td>-0.231082</td>\n",
" <td>-0.713830</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>1.797181</td>\n",
" <td>-0.952201</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1306</th>\n",
" <td>-1.505120</td>\n",
" <td>1.322876</td>\n",
" <td>0.740881</td>\n",
" <td>-0.400248</td>\n",
" <td>1.314435</td>\n",
" <td>1.493778</td>\n",
" <td>5.732668</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>1.807181</td>\n",
" <td>0.051886</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1307</th>\n",
" <td>0.873482</td>\n",
" <td>-0.755929</td>\n",
" <td>0.701476</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.507796</td>\n",
" <td>-0.231082</td>\n",
" <td>-0.713830</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>1.817182</td>\n",
" <td>0.051886</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1308</th>\n",
" <td>0.873482</td>\n",
" <td>-0.755929</td>\n",
" <td>-0.204852</td>\n",
" <td>-0.400248</td>\n",
" <td>-0.493455</td>\n",
" <td>-0.231082</td>\n",
" <td>-0.713830</td>\n",
" <td>-0.553443</td>\n",
" <td>0.807573</td>\n",
" <td>0.787165</td>\n",
" <td>-0.851793</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1309</th>\n",
" <td>0.873482</td>\n",
" <td>-0.755929</td>\n",
" <td>-0.204852</td>\n",
" <td>0.619896</td>\n",
" <td>-0.236957</td>\n",
" <td>1.493778</td>\n",
" <td>2.048955</td>\n",
" <td>0.764728</td>\n",
" <td>-1.238278</td>\n",
" <td>1.827182</td>\n",
" <td>-0.349749</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>418 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" Pclass Sex Age Parch Fare Embarked \\\n",
"PassengerId \n",
"892 0.873482 -0.755929 0.386231 -0.400248 -0.497413 -1.955941 \n",
"893 0.873482 1.322876 1.371370 -0.400248 -0.512278 -0.231082 \n",
"894 -0.315819 -0.755929 2.553537 -0.400248 -0.464100 -1.955941 \n",
"895 0.873482 -0.755929 -0.204852 -0.400248 -0.482475 -0.231082 \n",
"896 0.873482 1.322876 -0.598908 0.619896 -0.417492 -0.231082 \n",
"... ... ... ... ... ... ... \n",
"1305 0.873482 -0.755929 -0.204852 -0.400248 -0.493455 -0.231082 \n",
"1306 -1.505120 1.322876 0.740881 -0.400248 1.314435 1.493778 \n",
"1307 0.873482 -0.755929 0.701476 -0.400248 -0.507796 -0.231082 \n",
"1308 0.873482 -0.755929 -0.204852 -0.400248 -0.493455 -0.231082 \n",
"1309 0.873482 -0.755929 -0.204852 0.619896 -0.236957 1.493778 \n",
"\n",
" Title FamilySize Alone LName NameLength \n",
"PassengerId \n",
"892 -0.713830 -0.553443 0.807573 -1.672873 -1.153019 \n",
"893 0.207099 0.105643 -1.238278 -1.662873 0.453521 \n",
"894 -0.713830 -0.553443 0.807573 -1.652873 -0.249340 \n",
"895 -0.713830 -0.553443 0.807573 -1.642873 -1.153019 \n",
"896 0.207099 0.764728 -1.238278 -1.632872 1.658426 \n",
"... ... ... ... ... ... \n",
"1305 -0.713830 -0.553443 0.807573 1.797181 -0.952201 \n",
"1306 5.732668 -0.553443 0.807573 1.807181 0.051886 \n",
"1307 -0.713830 -0.553443 0.807573 1.817182 0.051886 \n",
"1308 -0.713830 -0.553443 0.807573 0.787165 -0.851793 \n",
"1309 2.048955 0.764728 -1.238278 1.827182 -0.349749 \n",
"\n",
"[418 rows x 11 columns]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Feature Engineering:\n",
"test_df['Title'] = test_df.Name.str.extract(r'([A-Za-z]+)\\.')\n",
"test_df['LName'] = test_df.Name.str.extract(r'([A-Za-z]+),')\n",
"test_df['NameLength'] = test_df.Name.apply(len)\n",
"test_df['FamilySize'] = 1 + test_df.SibSp + test_df.Parch\n",
"test_df['Alone'] = test_df.FamilySize.apply(lambda x: 1 if x==1 else 0)\n",
"test_df.Title = test_df.Title.map(title_dict)\n",
"\n",
"# Feature Selection\n",
"test_df = test_df.drop(todrop, axis=1)\n",
"\n",
"# Imputation of missing age and fare data\n",
"test_df.loc[test_df.Age.isna(), 'Age'] = test_df.Age.median()\n",
"test_df.loc[test_df.Fare.isna(), 'Fare'] = test_df.Fare.median()\n",
"\n",
"# encode categorical data\n",
"for i in test_df.columns:\n",
" if test_df[i].dtype == 'object':\n",
" test_df[i], _ = pd.factorize(test_df[i])\n",
" \n",
"# center and scale data \n",
"test_df = pd.DataFrame(pre.scale(test_df), columns=test_df.columns, index=test_df.index)\n",
"\n",
"# ensure columns of unlabeled data are in same order as training data.\n",
"test_df = test_df[x_test.columns]\n",
"test_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Make Final Predictions and Common Sense Check:\n",
"\n",
"Roughly 32 percent of the passengers aboard the Titanic lived. We will do a last, common sense check to see if our algorithm predicts roughly the same distribution of survivals. Since __Survived__ variable with value 1 implies survival, we can simply add all instances of survival and divide by the total number of passengers to get a rough idea of our predicted distribution. \n",
"\n",
"Keep in mind, the competition organizers could have been tricky and given us uneven distributions for training and testing. In that case, this might not work, but I'm assuming they did not. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"final = tuned_rf.predict(test_df)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.3660287081339713"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final.sum()/len(final)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>892</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>893</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>894</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>895</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>896</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>413</th>\n",
" <td>1305</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>414</th>\n",
" <td>1306</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>415</th>\n",
" <td>1307</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>416</th>\n",
" <td>1308</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>417</th>\n",
" <td>1309</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>418 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived\n",
"0 892 0\n",
"1 893 0\n",
"2 894 0\n",
"3 895 0\n",
"4 896 1\n",
".. ... ...\n",
"413 1305 0\n",
"414 1306 1\n",
"415 1307 0\n",
"416 1308 0\n",
"417 1309 1\n",
"\n",
"[418 rows x 2 columns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"submission = pd.DataFrame({'PassengerId':test_df.index,\n",
" 'Survived':final})\n",
"submission"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"submission.to_csv('submission2.csv', index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment