Created
May 15, 2022 23:02
-
-
Save RogerioLS/9da12d9d9b2619970b8539bcdbd00f6f to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n\n<h1 align=\"center\"><font size=\"5\">Classification with Python</font></h1>" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "In this notebook we try to practice all the classification algorithms that we learned in this course.\n\nWe load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.\n\nLets first load required libraries:" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": "import itertools\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom matplotlib.ticker import NullFormatter\nimport pandas as pd\nimport numpy as np\nimport matplotlib.ticker as ticker\nfrom sklearn import preprocessing\n%matplotlib inline" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### About dataset" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "This dataset is about past loans. The __Loan_train.csv__ data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:\n\n| Field | Description |\n|----------------|---------------------------------------------------------------------------------------|\n| Loan_status | Whether a loan is paid off on in collection |\n| Principal | Basic principal loan amount at the |\n| Terms | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |\n| Effective_date | When the loan got originated and took effects |\n| Due_date | Since it\u2019s one-time payoff schedule, each loan has one single due date |\n| Age | Age of applicant |\n| Education | Education of applicant |\n| Gender | The gender of applicant |" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "--2022-05-15 22:16:56-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv\nResolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196\nConnecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 23101 (23K) [text/csv]\nSaving to: \u2018loan_train.csv\u2019\n\nloan_train.csv 100%[===================>] 22.56K --.-KB/s in 0.001s \n\n2022-05-15 22:16:56 (15.6 MB/s) - \u2018loan_train.csv\u2019 saved [23101/23101]\n\n" | |
} | |
], | |
"source": "!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>Unnamed: 0.1</th>\n <th>loan_status</th>\n <th>Principal</th>\n <th>terms</th>\n <th>effective_date</th>\n <th>due_date</th>\n <th>age</th>\n <th>education</th>\n <th>Gender</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/8/2016</td>\n <td>10/7/2016</td>\n <td>45</td>\n <td>High School or Below</td>\n <td>male</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/8/2016</td>\n <td>10/7/2016</td>\n <td>33</td>\n <td>Bechalor</td>\n <td>female</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>3</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>15</td>\n <td>9/8/2016</td>\n <td>9/22/2016</td>\n <td>27</td>\n <td>college</td>\n <td>male</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>4</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/9/2016</td>\n <td>10/8/2016</td>\n <td>28</td>\n <td>college</td>\n <td>female</td>\n </tr>\n <tr>\n <th>4</th>\n <td>6</td>\n <td>6</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/9/2016</td>\n <td>10/8/2016</td>\n <td>29</td>\n <td>college</td>\n <td>male</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \\\n0 0 0 PAIDOFF 1000 30 9/8/2016 \n1 2 2 PAIDOFF 1000 30 9/8/2016 \n2 3 3 PAIDOFF 1000 15 9/8/2016 \n3 4 4 PAIDOFF 1000 30 9/9/2016 \n4 6 6 PAIDOFF 1000 30 9/9/2016 \n\n due_date age education Gender \n0 10/7/2016 45 High School or Below male \n1 10/7/2016 33 Bechalor female \n2 9/22/2016 27 college male \n3 10/8/2016 28 college female \n4 10/8/2016 29 college male " | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df = pd.read_csv('loan_train.csv')\ndf.head()" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "(346, 10)" | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df.shape" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Convert to date time object " | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>Unnamed: 0.1</th>\n <th>loan_status</th>\n <th>Principal</th>\n <th>terms</th>\n <th>effective_date</th>\n <th>due_date</th>\n <th>age</th>\n <th>education</th>\n <th>Gender</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-08</td>\n <td>2016-10-07</td>\n <td>45</td>\n <td>High School or Below</td>\n <td>male</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-08</td>\n <td>2016-10-07</td>\n <td>33</td>\n <td>Bechalor</td>\n <td>female</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>3</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>15</td>\n <td>2016-09-08</td>\n <td>2016-09-22</td>\n <td>27</td>\n <td>college</td>\n <td>male</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>4</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-09</td>\n <td>2016-10-08</td>\n <td>28</td>\n <td>college</td>\n <td>female</td>\n </tr>\n <tr>\n <th>4</th>\n <td>6</td>\n <td>6</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-09</td>\n <td>2016-10-08</td>\n <td>29</td>\n <td>college</td>\n <td>male</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \\\n0 0 0 PAIDOFF 1000 30 2016-09-08 \n1 2 2 PAIDOFF 1000 30 2016-09-08 \n2 3 3 PAIDOFF 1000 15 2016-09-08 \n3 4 4 PAIDOFF 1000 30 2016-09-09 \n4 6 6 PAIDOFF 1000 30 2016-09-09 \n\n due_date age education Gender \n0 2016-10-07 45 High School or Below male \n1 2016-10-07 33 Bechalor female \n2 2016-09-22 27 college male \n3 2016-10-08 28 college female \n4 2016-10-08 29 college male " | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df['due_date'] = pd.to_datetime(df['due_date'])\ndf['effective_date'] = pd.to_datetime(df['effective_date'])\ndf.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Data visualization and pre-processing" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Let\u2019s see how many of each class is in our data set " | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "PAIDOFF 260\nCOLLECTION 86\nName: loan_status, dtype: int64" | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df['loan_status'].value_counts()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "260 people have paid off the loan on time while 86 have gone into collection \nLets plot some columns to underestand data better:" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Collecting package metadata (current_repodata.json): done\nSolving environment: \\ \nThe environment is inconsistent, please check the package plan carefully\nThe following packages are causing the inconsistency:\n\n - defaults/linux-64::bokeh==2.3.2=py38h06a4308_0\n - defaults/noarch::tensorflow-estimator==2.6.0=pyh2888c4f_2\n - defaults/linux-64::tensorflow-base==2.6.2=h1234567_cpu_py38_pb3.14_1\n - defaults/linux-64::torchtext==0.10.0=py38_1\n - defaults/linux-64::scipy==1.6.2=py38hf56f3a7_1\n - defaults/linux-64::pyarrow==3.0.0=py38hc46b212_16_cpu\n - defaults/noarch::keras-preprocessing==1.1.2=pyhd3eb1b0_0\n - defaults/linux-64::matplotlib==3.3.4=py38h06a4308_0\n - defaults/linux-64::torchvision-base==0.10.0=cpu_py38_1\n - defaults/noarch::seaborn==0.11.1=pyhd3eb1b0_0\n - defaults/noarch::tensorflow-hub==0.12.0=pyh4e80574_pb3.14_5\n - defaults/noarch::ibm-wsrt-py38wnlp-main==0.0.0=0\n - defaults/noarch::opt_einsum==3.3.0=pyhd3eb1b0_1\n - defaults/linux-64::av==8.0.3=py38h968fa86_3\n - defaults/linux-64::grpcio-tools==1.35.0=py38h2531618_1\n - defaults/linux-64::pandas==1.2.4=py38h2531618_0\n - defaults/noarch::keras==2.6.0=pyhddf08d5_1\n - defaults/noarch::tensorboard==2.6.0=pyh80f5e3f_pb3.14_3\n - defaults/linux-64::numpy==1.19.2=py38h6163131_0\n - defaults/linux-64::matplotlib-base==3.3.4=py38h62a2d02_0\n - defaults/linux-64::scikit-learn==1.0.1=py38h51133e4_0\n - defaults/linux-64::arrow-cpp==3.0.0=py38hd6ed86c_16_cpu\n - defaults/linux-64::tensorflow-text==2.6.0=h027b3ee_py38_pb3.14_1\n - defaults/noarch::keras-applications==1.0.8=py_1\n - defaults/linux-64::tensorflow-cpu==2.6.2=py38_1\n - defaults/noarch::jaydebeapi==1.2.3=py_0\n - defaults/linux-64::tensorflow-addons==0.14.0=py38h1fd0ce0_2_cpu\nfailed with initial frozen solve. Retrying with flexible solve.\nSolving environment: failed with repodata from current_repodata.json, will retry with next repodata source.\nCollecting package metadata (repodata.json): done\nSolving environment: \\ \nThe environment is inconsistent, please check the package plan carefully\nThe following packages are causing the inconsistency:\n\n - defaults/linux-64::bokeh==2.3.2=py38h06a4308_0\n - defaults/noarch::tensorflow-estimator==2.6.0=pyh2888c4f_2\n - defaults/linux-64::tensorflow-base==2.6.2=h1234567_cpu_py38_pb3.14_1\n - defaults/linux-64::torchtext==0.10.0=py38_1\n - defaults/linux-64::scipy==1.6.2=py38hf56f3a7_1\n - defaults/linux-64::pyarrow==3.0.0=py38hc46b212_16_cpu\n - defaults/noarch::keras-preprocessing==1.1.2=pyhd3eb1b0_0\n - defaults/linux-64::matplotlib==3.3.4=py38h06a4308_0\n - defaults/linux-64::torchvision-base==0.10.0=cpu_py38_1\n - defaults/noarch::seaborn==0.11.1=pyhd3eb1b0_0\n - defaults/noarch::tensorflow-hub==0.12.0=pyh4e80574_pb3.14_5\n - defaults/noarch::ibm-wsrt-py38wnlp-main==0.0.0=0\n - defaults/noarch::opt_einsum==3.3.0=pyhd3eb1b0_1\n - defaults/linux-64::av==8.0.3=py38h968fa86_3\n - defaults/linux-64::grpcio-tools==1.35.0=py38h2531618_1\n - defaults/linux-64::pandas==1.2.4=py38h2531618_0\n - defaults/noarch::keras==2.6.0=pyhddf08d5_1\n - defaults/noarch::tensorboard==2.6.0=pyh80f5e3f_pb3.14_3\n - defaults/linux-64::numpy==1.19.2=py38h6163131_0\n - defaults/linux-64::matplotlib-base==3.3.4=py38h62a2d02_0\n - defaults/linux-64::scikit-learn==1.0.1=py38h51133e4_0\n - defaults/linux-64::arrow-cpp==3.0.0=py38hd6ed86c_16_cpu\n - defaults/linux-64::tensorflow-text==2.6.0=h027b3ee_py38_pb3.14_1\n - defaults/noarch::keras-applications==1.0.8=py_1\n - defaults/linux-64::tensorflow-cpu==2.6.2=py38_1\n - defaults/noarch::jaydebeapi==1.2.3=py_0\n - defaults/linux-64::tensorflow-addons==0.14.0=py38h1fd0ce0_2_cpu\ndone\n\n## Package Plan ##\n\n environment location: /opt/conda/envs/Python-3.8-Watson-NLP\n\n added / updated specs:\n - seaborn\n\n\nThe following packages will be downloaded:\n\n package | build\n ---------------------------|-----------------\n ca-certificates-2022.3.29 | h06a4308_1 124 KB anaconda\n certifi-2021.10.8 | py38h06a4308_2 156 KB anaconda\n grpcio-1.35.0 | py38h2157cd5_1 2.0 MB anaconda\n h5py-3.2.1 | py38h6c542dc_0 1.3 MB anaconda\n jpype1-1.2.1 | py38hff7bd54_0 462 KB anaconda\n lxml-4.6.3 | py38h9120a33_0 1.4 MB anaconda\n numpy-base-1.19.2 | py38h75fe3a5_0 5.3 MB anaconda\n openssl-1.1.1n | h7f8727e_0 3.8 MB anaconda\n python-flatbuffers-1.12 | pyhd3eb1b0_0 21 KB anaconda\n werkzeug-2.0.2 | pyhd3eb1b0_0 226 KB anaconda\n ------------------------------------------------------------\n Total: 14.8 MB\n\nThe following NEW packages will be INSTALLED:\n\n grpcio anaconda/linux-64::grpcio-1.35.0-py38h2157cd5_1\n h5py anaconda/linux-64::h5py-3.2.1-py38h6c542dc_0\n jpype1 anaconda/linux-64::jpype1-1.2.1-py38hff7bd54_0\n lxml anaconda/linux-64::lxml-4.6.3-py38h9120a33_0\n numpy-base anaconda/linux-64::numpy-base-1.19.2-py38h75fe3a5_0\n python-flatbuffers anaconda/noarch::python-flatbuffers-1.12-pyhd3eb1b0_0\n pytorch-base opt/ibm/filechannels/wnlp-oce141-py38-x86/open-ce/1.4.1/linux-64::pytorch-base-1.9.0-h1234567_cpu_py38_pb3.14_1\n sentencepiece opt/ibm/filechannels/wnlp-oce141-py38-x86/open-ce/1.4.1/linux-64::sentencepiece-0.1.91-hedccbcc_py38_pb3.14_7\n werkzeug anaconda/noarch::werkzeug-2.0.2-pyhd3eb1b0_0\n\nThe following packages will be SUPERSEDED by a higher-priority channel:\n\n ca-certificates pkgs/main --> anaconda\n certifi pkgs/main --> anaconda\n openssl pkgs/main --> anaconda\n\n\n\nDownloading and Extracting Packages\ncertifi-2021.10.8 | 156 KB | ##################################### | 100% \nnumpy-base-1.19.2 | 5.3 MB | ##################################### | 100% \nwerkzeug-2.0.2 | 226 KB | ##################################### | 100% \nopenssl-1.1.1n | 3.8 MB | ##################################### | 100% \npython-flatbuffers-1 | 21 KB | ##################################### | 100% \nca-certificates-2022 | 124 KB | ##################################### | 100% \njpype1-1.2.1 | 462 KB | ##################################### | 100% \nlxml-4.6.3 | 1.4 MB | ##################################### | 100% \ngrpcio-1.35.0 | 2.0 MB | ##################################### | 100% \nh5py-3.2.1 | 1.3 MB | ##################################### | 100% \nPreparing transaction: done\nVerifying transaction: done\nExecuting transaction: done\n" | |
} | |
], | |
"source": "# notice: installing seaborn might takes a few minutes\n!conda install -c anaconda seaborn -y" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAADQCAYAAABStPXYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAbBklEQVR4nO3de5xVdb3/8ddbnBwRzQuTIoQzKpIg/HY6aWZ2EI3wBnEsxcyk4zmkqcUps9CyTj4yE0rzeAtPhI+OoFSKhnmL4BiWF8BRwAveJpwEBOyRkkCAn98fe824Gfcwl71nZs3e7+fjsR57re9el89i9pfP/n7X2uuriMDMzCxtduruAMzMzPJxgjIzs1RygjIzs1RygjIzs1RygjIzs1RygjIzs1RyguokkvaVNFPSy5IWS/qzpHFF2vcISXOLsa+uIGmBpNrujsO6RynVBUlVkh6T9KSkYzvxOBs6a989iRNUJ5AkYA7wcEQcGBFHAOOBAd0Uz87dcVyzEqwLxwPPRcSHI+KPxYjJWuYE1TlGAv+MiJsbCyLiLxHx3wCSekmaIukJSU9L+lJSPiJpbfxa0nOSbksqOJJGJ2ULgX9t3K+k3SRNT/b1pKSxSfkESb+S9FvgwUJORtIMSTdJmp98C/6X5JjPSpqRs95NkhZJWi7pv1rY16jkG/SSJL4+hcRmqVcydUFSBrgaOElSnaRdW/o8S6qXdGXy3iJJh0t6QNJLks5L1ukjaV6y7dLGePMc9xs5/z5561XJighPRZ6ArwDX7OD9icC3k/ldgEVADTAC+DvZb5c7AX8GPg5UAq8CgwABs4G5yfZXAp9P5vcEVgC7AROABmDvFmL4I1CXZzohz7ozgNuTY48F3gSGJTEuBjLJensnr72ABcDwZHkBUAv0BR4GdkvKvwlc3t1/L0+dN5VgXZgAXJ/Mt/h5BuqB85P5a4Cngd2BKuD1pHxnYI+cfb0IKFnekLyOAqYl57oTMBf4RHf/XbtqctdPF5B0A9nK9c+I+AjZD91wSZ9JVnk/2Qr3T+DxiGhItqsDqoENwCsR8UJS/r9kKzbJvsZIujhZrgQGJvMPRcQb+WKKiPb2n/82IkLSUmBNRCxNYlmexFgHnC5pItmK1w8YQrZiNvpoUvZI8mX4fWT/47EyUSJ1oVFrn+d7ktelQJ+IeAt4S9ImSXsC/wCulPQJ4B2gP7AvsDpnH6OS6clkuQ/Zf5+HOxhzj+IE1TmWA6c1LkTEBZL6kv12CNlvQxdFxAO5G0kaAWzOKdrGu3+jlh6aKOC0iHi+2b6OIlsB8m8k/ZHsN7rmLo6I3+cpb4zrnWYxvgPsLKkGuBj4SET8Len6q8wT60MRcWZLcVnJKcW6kHu8HX2ed1hngLPItqiOiIgtkurJX2d+GBE/20EcJcvXoDrHH4BKSefnlPXOmX8AOF9SBYCkQyTttoP9PQfUSDooWc6tEA8AF+X0z3+4LQFGxLERkckz7ahC7sgeZP8T+LukfYET86zzKHCMpIOTWHtLOqSDx7OeoZTrQqGf5/eT7e7bIuk44IA86zwA/FvOta3+kj7QjmP0aE5QnSCyncefBv5F0iuSHgduJdtHDfA/wDPAEknLgJ+xg9ZsRGwi241xb3Jh+C85b18BVABPJ/u6osin0yYR8RTZbojlwHTgkTzrrCXbhz9L0tNkK/iHujBM62KlXBeK8Hm+DaiVtIhsa+q5PMd4EJgJ/DnpXv81+Vt7JanxgpyZmVmquAVlZmap5ARlZmap5ARlZmap5ARlZmaplIoENXr06CD72wZPnkphKirXD08lNrVZKhLUunXrujsEs9Ry/bBylYoEZWZm1pwTlJmZpZITlJmZpZIfFmtmJWXLli00NDSwadOm7g6lrFVWVjJgwAAqKio6vA8nKDMrKQ0NDey+++5UV1eTPDfWulhEsH79ehoaGqipqenwftzFZ2YlZdOmTeyzzz5OTt1IEvvss0/BrVgnKCsbB/Trh6SCpwP69evuU7FWODl1v2L8DdzFZ2Vj5erVNOw/oOD9DHitoQjRmFlr3IIys5JWrJZze1rQvXr1IpPJcNhhh/HZz36Wt99+G4CtW7fSt29fJk+evN36I0aMYNGi7CDD1dXVDBs2jGHDhjFkyBC+/e1vs3nzuwPyLl++nJEjR3LIIYcwaNAgrrjiChqHTZoxYwZVVVVkMhkymQxf+MIXAJgwYQI1NTVN5dddd11R/m07m1tQZlbSitVybtSWFvSuu+5KXV0dAGeddRY333wzX/va13jwwQcZPHgws2fP5sorr2yxG2z+/Pn07duXDRs2MHHiRCZOnMitt97Kxo0bGTNmDDfddBOjRo3i7bff5rTTTuPGG2/kggsuAOCMM87g+uuvf88+p0yZwmc+85mOn3g3aLUFJWm6pNeTESoby74n6a+S6pLppJz3Jkt6UdLzkj7VWYGbmfUExx57LC+++CIAs2bN4qtf/SoDBw7k0UcfbXXbPn36cPPNNzNnzhzeeOMNZs6cyTHHHMOoUaMA6N27N9dffz1XXXVVp55Dd2lLF98MYHSe8msiIpNMvwOQNAQYDwxNtrlRUq9iBWtm1pNs3bqV++67j2HDhrFx40bmzZvHKaecwplnnsmsWbPatI899tiDmpoaXnjhBZYvX84RRxyx3fsHHXQQGzZs4M033wTgjjvuaOrK+8UvftG03je+8Y2m8qVLlxbvJDtRqwkqIh4G3mjj/sYCt0fE5oh4BXgROLKA+MzMepyNGzeSyWSora1l4MCBnHvuucydO5fjjjuO3r17c9ppp3HXXXexbdu2Nu2v8RpTRLTYLdhYfsYZZ1BXV0ddXR1f/OIXm96fMmVKU/mwYcMKPMOuUcg1qAslfQFYBHw9Iv4G9Ady260NSdl7SJoITAQYOHBgAWGYlR7Xj54t9xpUo1mzZvHII49QXV0NwPr165k/fz4nnHDCDvf11ltvUV9fzyGHHMLQoUN5+OGHt3v/5Zdfpk+fPuy+++7FPIVU6OhdfDcBBwEZYBXw46Q8X2rPO/5HREyLiNqIqK2qqupgGGalyfWjtLz55pssXLiQlStXUl9fT319PTfccEOr3XwbNmzgy1/+Mp/+9KfZa6+9OOuss1i4cCG///3vgWxL7Stf+QqXXHJJV5xGl+tQCyoi1jTOS7oFmJssNgAfzFl1APBah6MzMyvQwP32K+pv1wbut1+7t7nzzjsZOXIku+yyS1PZ2LFjueSSS7a7hbzRcccdR0TwzjvvMG7cOL7zne8A2ZbZ3XffzUUXXcQFF1zAtm3bOPvss7nwwgs7fkIppsa+zR2uJFUDcyPisGS5X0SsSub/EzgqIsZLGgrMJHvdaX9gHjAoInbY0VpbWxuNvwEw6yySivZD3VbqTVEfY+D60T7PPvsshx56aHeHYbT4t2hz/Wi1BSVpFjAC6CupAfguMEJShmz3XT3wJYCIWC5pNvAMsBW4oLXkZGZmlk+rCSoizsxT/PMdrP8D4AeFBGVmZuZHHZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZlZSdt/wMCiDrex/4DWn+yxevVqxo8fz0EHHcSQIUM46aSTWLFiRatDZeT7PVN1dTXr1q3brqz5sBqZTIZnnnkGgBUrVnDSSSdx8MEHc+ihh3L66adv93y+Pn36MHjw4KbhOBYsWMApp5zStO85c+YwfPhwPvShDzFs2DDmzJnT9N6ECRPo379/02+31q1b1/RkjM7g4TbMrKSt+uurHHX5/UXb32Pfz/fs7HdFBOPGjeOcc87h9ttvB6Curo41a9YwYcKEHQ6V0R75htXYtGkTJ598Mj/5yU849dRTgezQHVVVVU2PXhoxYgRTp06ltrYWgAULFjRt/9RTT3HxxRfz0EMPUVNTwyuvvMInP/lJDjzwQIYPHw5kx7qaPn06559/frtjbi+3oMzMimj+/PlUVFRw3nnnNZVlMhlWrFjR6UNlzJw5k6OPPropOUH2qRSHHXZYm7afOnUql156KTU1NQDU1NQwefJkpkyZ0rTOpEmTuOaaa9i6dWvR4m6JE5SZWREtW7bsPUNiAG0aKqM9crvtMpkMGzdubPHYbZUvxtraWpYvX960PHDgQD7+8Y/zy1/+ssPHaSt38ZmZdYG2DJXRHi2NnFuIfDHmK7v00ksZM2YMJ598clGP35xbUGZmRTR06FAWL16ct7z5MxWLPVRGS8duz/bNY1yyZAlDhgzZruzggw8mk8kwe/bsDh+rLZygzMyKaOTIkWzevJlbbrmlqeyJJ55g0KBBnT5Uxuc+9zn+9Kc/ce+99zaV3X///W0eQffiiy/mhz/8IfX19QDU19dz5ZVX8vWvf/0961522WVMnTq1KHG3xF18ZlbS+vX/YKt33rV3fzsiibvuuotJkyZx1VVXUVlZSXV1Nddee22rQ2XMmDFju9u6H300O/7r8OHD2WmnbHvi9NNPZ/jw4dxxxx0sXLiwad0bb7yRj33sY8ydO5dJkyYxadIkKioqGD58OD/96U/bdG6ZTIYf/ehHnHrqqWzZsoWKigquvvpqMpnMe9YdOnQohx9+OEuWLGnTvjuiTcNtdDYPJ2BdwcNtlAcPt5EehQ630WoXn6Tpkl6XtCynbIqk5yQ9LekuSXsm5dWSNkqqS6ab2xqImZlZrrZcg5oBNG8fPwQcFhHDgRXA5Jz3XoqITDKdh5mZWQe0mqAi4mHgjWZlD0ZE46+0HiU7tLuZWSqk4dJFuSvG36AYd/H9G3BfznKNpCcl/Z+kY1vaSNJESYskLVq7dm0RwjArHa4fHVdZWcn69eudpLpRRLB+/XoqKysL2k9Bd/FJuozs0O63JUWrgIERsV7SEcAcSUMj4j0/k46IacA0yF4ELiQOs1Lj+tFxAwYMoKGhASf27lVZWcmAAYV1rnU4QUk6BzgFOD6SryoRsRnYnMwvlvQScAjgW5DMrEtUVFQ0PUvOerYOdfFJGg18ExgTEW/nlFdJ6pXMHwgMAl4uRqBmZlZeWm1BSZoFjAD6SmoAvkv2rr1dgIeSZzQ9mtyx9wng+5K2AtuA8yLijbw7NjMz24FWE1REnJmn+OctrPsb4DeFBmVmZuZn8ZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSo5QZmZWSq1mqAkTZf0uqRlOWV7S3pI0gvJ6145702W9KKk5yV9qrMCNzOz0taWFtQMYHSzsm8B8yJiEDAvWUbSEGA8MDTZ5sbGEXbNzMzao9UEFREPA81HxR0L3JrM3wp8Oqf89ojYHBGvAC8CRxYnVDMzKycdvQa1b0SsAkheP5CU9wdezVmvISl7D0kTJS2StGjt2rUdDMOsNLl+mBX/JgnlKYt8K0bEtIiojYjaqqqqIodh1rO5fph1PEGtkdQPIHl9PSlvAD6Ys94A4LWOh2dmZuWqownqHuCcZP4c4O6c8vGSdpFUAwwCHi8sRDMzK0c7t7aCpFnACKCvpAbgu8BVwGxJ5wIrgc8CRMRySbOBZ4CtwAURsa2TYjczsxLWaoKKiDNbeOv4Ftb/AfCDQoIyMzPzkyTMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVnKDMzCyVWn2aeUskDQbuyCk6ELgc2BP4D6BxnOpLI+J3HT2OmZmVpw4nqIh4HsgASOoF/BW4C/gicE1ETC1GgGZmVp6K1cV3PPBSRPylSPszM7MyV6wENR6YlbN8oaSnJU2XtFe+DSRNlLRI0qK1a9fmW8WsbLl+mBUhQUl6HzAG+FVSdBNwENnuv1XAj/NtFxHTIqI2ImqrqqoKDcOspLh+mBWnBXUisCQi1gBExJqI2BYR7wC3AEcW4RhmZlZmipGgziSne09Sv5z3xgHLinAMMzMrMx2+iw9AUm/gk8CXcoqvlpQBAqhv9p6ZmVmbFJSgIuJtYJ9mZWcXFJGZmRl+koSZmaWUE5SZmaWSE5SZmaWSE5SZmaWSE5SZmaWSE5SZmaVSQbeZm/Uk6lXBgNcairIfM+t8TlBWNmLbFo66/P6C9/PY90cXIRoza427+MzMLJWcoMzMLJWcoMzMLJWcoMzMLJWcoMzMLJWcoMzMLJUKHQ+qHngL2AZsjYhaSXsDdwDVZMeDOj0i/lZYmGZmVm6K0YI6LiIyEVGbLH8LmBcRg4B5ybKVoQP69UNSwdMB/fq1fjAzKzmd8UPdscCIZP5WYAHwzU44jqXcytWradh/QMH7KcbTH8ys5ym0BRXAg5IWS5qYlO0bEasAktcP5NtQ0kRJiyQtWrt2bYFhmJUW1w+zwhPUMRFxOHAicIGkT7R1w4iYFhG1EVFbVVVVYBhmpcX1w6zABBURryWvrwN3AUcCayT1A0heXy80SDMzKz8dTlCSdpO0e+M8MApYBtwDnJOsdg5wd6FBmplZ+SnkJol9gbskNe5nZkTcL+kJYLakc4GVwGcLD9PMzMpNhxNURLwM/L885euB4wsJyszMzE+SMDOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjOzVHKCMjMrc2kd/bozRtQ1M7MeJK2jX7sFZWZmqVTIeFAflDRf0rOSlkv6alL+PUl/lVSXTCcVL1wzMysXhXTxbQW+HhFLkoELF0t6KHnvmoiYWnh4ZmZWrgoZD2oVsCqZf0vSs0D/YgVmZmblrSjXoCRVAx8GHkuKLpT0tKTpkvZqYZuJkhZJWrR27dpihGFWMlw/zIqQoCT1AX4DTIqIN4GbgIOADNkW1o/zbRcR0yKiNiJqq6qqCg3DrKS4fpgVmKAkVZBNTrdFxJ0AEbEmIrZFxDvALcCRhYdpZmblppC7+AT8HHg2In6SU577S61xwLKOh2dmZuWqkLv4jgHOBpZKqkvKLgXOlJQBAqgHvlTAMczMrEwVchffQkB53vpdx8MxMzPL8pMkzMwslfwsPus06lVRlGdzqVdFEaIxs57GCco6TWzbwlGX31/wfh77/ugiRGNmPY27+MzMLJWcoMzMLJWcoMzMLJWcoMzMLJWcoMzMulhah1hPG9/FZ2bWxdI6xHrauAVlZmap5ARlZmap5C4+M7Myl9anvjhBmZmVubQ+9cVdfGZmlkqdlqAkjZb0vKQXJX2r0P35tkwzs/LSKV18knoBNwCfBBqAJyTdExHPdHSfvi3TzKy8dNY1qCOBFyPiZQBJtwNjgQ4nqLQ5oF8/Vq5eXfB+Bu63H39ZtaoIEZU2Kd/YmJZGrhutK9ZNCTv1qijpuqGIKP5Opc8AoyPi35Pls4GjIuLCnHUmAhOTxcHA80UPpO36Auu68fiF6Kmx99S4ofXY10VEQVeLU1Q/SvnvlGY9Nfa2xN3m+tFZLah8KX27TBgR04BpnXT8dpG0KCJquzuOjuipsffUuKFrYk9L/fDfqXv01NiLHXdn3STRAHwwZ3kA8FonHcvMzEpQZyWoJ4BBkmokvQ8YD9zTSccyM7MS1CldfBGxVdKFwANAL2B6RCzvjGMVSbd3pRSgp8beU+OGnh17e/Xkc3XsXa+ocXfKTRJmZmaF8pMkzMwslZygzMwslcomQUnqJelJSXOT5b0lPSTpheR1r5x1JyePaHpe0qe6L2qQtKekX0t6TtKzko7uCbFL+k9JyyUtkzRLUmVa45Y0XdLrkpbllLU7VklHSFqavHedesgvKF03uiV214+21I+IKIsJ+BowE5ibLF8NfCuZ/xbwo2R+CPAUsAtQA7wE9OrGuG8F/j2Zfx+wZ9pjB/oDrwC7JsuzgQlpjRv4BHA4sCynrN2xAo8DR5P9HeB9wInd9blp5/m7bnRt3K4fbawf3V45uugfeAAwDxiZUwmfB/ol8/2A55P5ycDknG0fAI7uprj3SD7Ialae6tiTCvgqsDfZO0XnAqPSHDdQ3awCtivWZJ3ncsrPBH7WHZ+bdp6360bXx+760cb6US5dfNcClwDv5JTtGxGrAJLXDyTljR+eRg1JWXc4EFgL/CLpgvkfSbuR8tgj4q/AVGAlsAr4e0Q8SMrjbqa9sfZP5puXp921uG50KdeP7cp3qOQTlKRTgNcjYnFbN8lT1l334u9Mtml9U0R8GPgH2eZ0S1IRe9IfPZZsE39/YDdJn9/RJnnK0vr7h5Zi7UnnALhu0E2xu35sV75DJZ+ggGOAMZLqgduBkZL+F1gjqR9A8vp6sn6aHtPUADRExGPJ8q/JVsq0x34C8EpErI2ILcCdwMdIf9y52htrQzLfvDzNXDe6h+tHG8+h5BNUREyOiAERUU32kUt/iIjPk3300jnJaucAdyfz9wDjJe0iqQYYRPbiXpeLiNXAq5IGJ0XHkx2yJO2xrwQ+Kql3cqfO8cCzpD/uXO2KNenmeEvSR5Nz/kLONqnkutFtnzHXj7bWj+64SNhdEzCCdy8E70P24vALyeveOetdRvbuk+fp5juxgAywCHgamAPs1RNiB/4LeA5YBvyS7F09qYwbmEX2WsAWst/0zu1IrEBtcr4vAdfT7AJ+mifXjS6P3fWjDfXDjzoyM7NUKvkuPjMz65mcoMzMLJWcoMzMLJWcoMzMLJWcoMzMLJWcoFJM0jZJdckTj38lqXcL6/2pg/uvlXRdAfFt6Oi2ZoVw3SgPvs08xSRtiIg+yfxtwOKI+EnO+70iYlsa4jPrSq4b5cEtqJ7jj8DBkkZImi9pJrAU3v22lry3QO+OkXNb45grkj4i6U+SnpL0uKTdk/UbxwD6nqRfSvpDMsbLfyTlfSTNk7QkGctlbPecvlmLXDdK1M7dHYC1TtLOwInA/UnRkcBhEfFKntU/DAwl+5yrR4BjJD0O3AGcERFPSNoD2Jhn2+HAR4HdgCcl3Uv2GVvjIuJNSX2BRyXdE256Wwq4bpQ2t6DSbVdJdWQf57IS+HlS/ngLFbDxvYaIeAeoIzuOy2BgVUQ8ARARb0bE1jzb3h0RGyNiHTCfbGUXcKWkp4Hfk31E/r7FODmzArhulAG3oNJtY0RkcguSXol/7GCbzTnz28j+jUXbHs/ffJ0AzgKqgCMiYouyT76ubMO+zDqT60YZcAuqPDwH7C/pIwBJH3u+LydjJVVK2ofsw0OfAN5PdsygLZKOAw7oqqDNuoDrRoq5BVUGIuKfks4A/lvSrmT72E/Is+rjwL3AQOCKiHgtuUPqt5IWke0Wea6LwjbrdK4b6ebbzA3I3qkEbIiIqd0di1mauG50H3fxmZlZKrkFZWZmqeQWlJmZpZITlJmZpZITlJmZpZITlJmZpZITlJmZpdL/B7A+/1urYJiLAAAAAElFTkSuQmCC\n", | |
"text/plain": "<Figure size 432x216 with 2 Axes>" | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": "import seaborn as sns\n\nbins = np.linspace(df.Principal.min(), df.Principal.max(), 10)\ng = sns.FacetGrid(df, col=\"Gender\", hue=\"loan_status\", palette=\"Set1\", col_wrap=2)\ng.map(plt.hist, 'Principal', bins=bins, ec=\"k\")\n\ng.axes[-1].legend()\nplt.show()" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": "<Figure size 432x216 with 2 Axes>" | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": "bins = np.linspace(df.age.min(), df.age.max(), 10)\ng = sns.FacetGrid(df, col=\"Gender\", hue=\"loan_status\", palette=\"Set1\", col_wrap=2)\ng.map(plt.hist, 'age', bins=bins, ec=\"k\")\n\ng.axes[-1].legend()\nplt.show()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Pre-processing: Feature selection/extraction\n\n### Lets look at the day of the week people get the loan " | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": "<Figure size 432x216 with 2 Axes>" | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": "df['dayofweek'] = df['effective_date'].dt.dayofweek\nbins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)\ng = sns.FacetGrid(df, col=\"Gender\", hue=\"loan_status\", palette=\"Set1\", col_wrap=2)\ng.map(plt.hist, 'dayofweek', bins=bins, ec=\"k\")\ng.axes[-1].legend()\nplt.show()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4 " | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>Unnamed: 0.1</th>\n <th>loan_status</th>\n <th>Principal</th>\n <th>terms</th>\n <th>effective_date</th>\n <th>due_date</th>\n <th>age</th>\n <th>education</th>\n <th>Gender</th>\n <th>dayofweek</th>\n <th>weekend</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-08</td>\n <td>2016-10-07</td>\n <td>45</td>\n <td>High School or Below</td>\n <td>male</td>\n <td>3</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-08</td>\n <td>2016-10-07</td>\n <td>33</td>\n <td>Bechalor</td>\n <td>female</td>\n <td>3</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>3</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>15</td>\n <td>2016-09-08</td>\n <td>2016-09-22</td>\n <td>27</td>\n <td>college</td>\n <td>male</td>\n <td>3</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>4</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-09</td>\n <td>2016-10-08</td>\n <td>28</td>\n <td>college</td>\n <td>female</td>\n <td>4</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>6</td>\n <td>6</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-09</td>\n <td>2016-10-08</td>\n <td>29</td>\n <td>college</td>\n <td>male</td>\n <td>4</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \\\n0 0 0 PAIDOFF 1000 30 2016-09-08 \n1 2 2 PAIDOFF 1000 30 2016-09-08 \n2 3 3 PAIDOFF 1000 15 2016-09-08 \n3 4 4 PAIDOFF 1000 30 2016-09-09 \n4 6 6 PAIDOFF 1000 30 2016-09-09 \n\n due_date age education Gender dayofweek weekend \n0 2016-10-07 45 High School or Below male 3 0 \n1 2016-10-07 33 Bechalor female 3 0 \n2 2016-09-22 27 college male 3 0 \n3 2016-10-08 28 college female 4 1 \n4 2016-10-08 29 college male 4 1 " | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)\ndf.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Convert Categorical features to numerical values\n\nLets look at gender:" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "Gender loan_status\nfemale PAIDOFF 0.865385\n COLLECTION 0.134615\nmale PAIDOFF 0.731293\n COLLECTION 0.268707\nName: loan_status, dtype: float64" | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>Unnamed: 0.1</th>\n <th>loan_status</th>\n <th>Principal</th>\n <th>terms</th>\n <th>effective_date</th>\n <th>due_date</th>\n <th>age</th>\n <th>education</th>\n <th>Gender</th>\n <th>dayofweek</th>\n <th>weekend</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>0</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-08</td>\n <td>2016-10-07</td>\n <td>45</td>\n <td>High School or Below</td>\n <td>0</td>\n <td>3</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-08</td>\n <td>2016-10-07</td>\n <td>33</td>\n <td>Bechalor</td>\n <td>1</td>\n <td>3</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>3</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>15</td>\n <td>2016-09-08</td>\n <td>2016-09-22</td>\n <td>27</td>\n <td>college</td>\n <td>0</td>\n <td>3</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>4</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-09</td>\n <td>2016-10-08</td>\n <td>28</td>\n <td>college</td>\n <td>1</td>\n <td>4</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>6</td>\n <td>6</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>2016-09-09</td>\n <td>2016-10-08</td>\n <td>29</td>\n <td>college</td>\n <td>0</td>\n <td>4</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \\\n0 0 0 PAIDOFF 1000 30 2016-09-08 \n1 2 2 PAIDOFF 1000 30 2016-09-08 \n2 3 3 PAIDOFF 1000 15 2016-09-08 \n3 4 4 PAIDOFF 1000 30 2016-09-09 \n4 6 6 PAIDOFF 1000 30 2016-09-09 \n\n due_date age education Gender dayofweek weekend \n0 2016-10-07 45 High School or Below 0 3 0 \n1 2016-10-07 33 Bechalor 1 3 0 \n2 2016-09-22 27 college 0 3 0 \n3 2016-10-08 28 college 1 4 1 \n4 2016-10-08 29 college 0 4 1 " | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)\ndf.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## One Hot Encoding \n#### How about education?" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "education loan_status\nBechalor PAIDOFF 0.750000\n COLLECTION 0.250000\nHigh School or Below PAIDOFF 0.741722\n COLLECTION 0.258278\nMaster or Above COLLECTION 0.500000\n PAIDOFF 0.500000\ncollege PAIDOFF 0.765101\n COLLECTION 0.234899\nName: loan_status, dtype: float64" | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df.groupby(['education'])['loan_status'].value_counts(normalize=True)" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "#### Feature befor One Hot Encoding" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Principal</th>\n <th>terms</th>\n <th>age</th>\n <th>Gender</th>\n <th>education</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1000</td>\n <td>30</td>\n <td>45</td>\n <td>0</td>\n <td>High School or Below</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1000</td>\n <td>30</td>\n <td>33</td>\n <td>1</td>\n <td>Bechalor</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1000</td>\n <td>15</td>\n <td>27</td>\n <td>0</td>\n <td>college</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1000</td>\n <td>30</td>\n <td>28</td>\n <td>1</td>\n <td>college</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1000</td>\n <td>30</td>\n <td>29</td>\n <td>0</td>\n <td>college</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Principal terms age Gender education\n0 1000 30 45 0 High School or Below\n1 1000 30 33 1 Bechalor\n2 1000 15 27 0 college\n3 1000 30 28 1 college\n4 1000 30 29 0 college" | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "df[['Principal','terms','age','Gender','education']].head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "#### Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame " | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Principal</th>\n <th>terms</th>\n <th>age</th>\n <th>Gender</th>\n <th>weekend</th>\n <th>Bechalor</th>\n <th>High School or Below</th>\n <th>college</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1000</td>\n <td>30</td>\n <td>45</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1000</td>\n <td>30</td>\n <td>33</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1000</td>\n <td>15</td>\n <td>27</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1000</td>\n <td>30</td>\n <td>28</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1000</td>\n <td>30</td>\n <td>29</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Principal terms age Gender weekend Bechalor High School or Below \\\n0 1000 30 45 0 0 0 1 \n1 1000 30 33 1 0 1 0 \n2 1000 15 27 0 0 0 0 \n3 1000 30 28 1 1 0 0 \n4 1000 30 29 0 1 0 0 \n\n college \n0 0 \n1 0 \n2 1 \n3 1 \n4 1 " | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "Feature = df[['Principal','terms','age','Gender','weekend']]\nFeature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)\nFeature.drop(['Master or Above'], axis = 1,inplace=True)\nFeature.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Feature selection\n\nLets defind feature sets, X:" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Principal</th>\n <th>terms</th>\n <th>age</th>\n <th>Gender</th>\n <th>weekend</th>\n <th>Bechalor</th>\n <th>High School or Below</th>\n <th>college</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1000</td>\n <td>30</td>\n <td>45</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1000</td>\n <td>30</td>\n <td>33</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1000</td>\n <td>15</td>\n <td>27</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1000</td>\n <td>30</td>\n <td>28</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1000</td>\n <td>30</td>\n <td>29</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Principal terms age Gender weekend Bechalor High School or Below \\\n0 1000 30 45 0 0 0 1 \n1 1000 30 33 1 0 1 0 \n2 1000 15 27 0 0 0 0 \n3 1000 30 28 1 1 0 0 \n4 1000 30 29 0 1 0 0 \n\n college \n0 0 \n1 0 \n2 1 \n3 1 \n4 1 " | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "X = Feature\nX[0:5]" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "What are our lables?" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],\n dtype=object)" | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "y = df['loan_status'].values\ny[0:5]" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Normalize Data \n\nData Standardization give data zero mean and unit variance (technically should be done after train test split )" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "array([[ 0.51578458, 0.92071769, 2.33152555, -0.42056004, -1.20577805,\n -0.38170062, 1.13639374, -0.86968108],\n [ 0.51578458, 0.92071769, 0.34170148, 2.37778177, -1.20577805,\n 2.61985426, -0.87997669, -0.86968108],\n [ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,\n -0.38170062, -0.87997669, 1.14984679],\n [ 0.51578458, 0.92071769, -0.48739188, 2.37778177, 0.82934003,\n -0.38170062, -0.87997669, 1.14984679],\n [ 0.51578458, 0.92071769, -0.3215732 , -0.42056004, 0.82934003,\n -0.38170062, -0.87997669, 1.14984679]])" | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "X= preprocessing.StandardScaler().fit(X).transform(X)\nX[0:5]" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Classification \n\nNow, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model\nYou should use the following algorithm:\n- K Nearest Neighbor(KNN)\n- Decision Tree\n- Support Vector Machine\n- Logistic Regression\n\n\n\n__ Notice:__ \n- You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.\n- You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.\n- You should include the code of the algorithm in the following cells." | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Train set: (276, 8) (276,)\nTest set: (70, 8) (70,)\n" | |
} | |
], | |
"source": "from sklearn.model_selection import train_test_split\nx_train, x_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\nprint ('Train set:', x_train.shape, y_train.shape)\nprint ('Test set:', x_test.shape, y_test.shape)" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# K Nearest Neighbor(KNN)\nNotice: You should find the best k to build the model with the best accuracy. \n**warning:** You should not use the __loan_test.csv__ for finding the best k, however, you can split your train_loan.csv into train and test to find the best __k__." | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [], | |
"source": "from sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.metrics import accuracy_score" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Checking for the best value of K" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "For K = 1 accuracy = 0.6428571428571429\nFor K = 2 accuracy = 0.5857142857142857\nFor K = 3 accuracy = 0.7428571428571429\nFor K = 4 accuracy = 0.7\nFor K = 5 accuracy = 0.7428571428571429\nFor K = 6 accuracy = 0.7142857142857143\nFor K = 7 accuracy = 0.8\nFor K = 8 accuracy = 0.7571428571428571\nFor K = 9 accuracy = 0.7428571428571429\n" | |
} | |
], | |
"source": "for k in range(1, 10):\n knn_model = KNeighborsClassifier(n_neighbors = k).fit(x_train, y_train)\n knn_yhat = knn_model.predict(x_test)\n print(\"For K = {} accuracy = {}\".format(k,accuracy_score(y_test,knn_yhat)))" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "We can see that the KNN model is the best for K=7\n" | |
} | |
], | |
"source": "print(\"We can see that the KNN model is the best for K=7\")" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Building the model with the best value of K = 7" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "KNeighborsClassifier(n_neighbors=7)" | |
}, | |
"execution_count": 35, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "best_knn_model = KNeighborsClassifier(n_neighbors = 7).fit(x_train, y_train)\nbest_knn_model" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Train set Accuracy (Jaccard): 0.7759336099585062\nTest set Accuracy (Jaccard): 0.7741935483870968\nTrain set Accuracy (F1): 0.7942614463042823\nTest set Accuracy (F1): 0.8\n" | |
} | |
], | |
"source": "## Evaluation Metrics\n# jaccard score and f1 score\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\n\nprint(\"Train set Accuracy (Jaccard): \", jaccard_score(y_train, best_knn_model.predict(x_train),pos_label = \"PAIDOFF\"))\nprint(\"Test set Accuracy (Jaccard): \", jaccard_score(y_test, best_knn_model.predict(x_test),pos_label = \"PAIDOFF\"))\n\nprint(\"Train set Accuracy (F1): \", f1_score(y_train, best_knn_model.predict(x_train), average='weighted'))\nprint(\"Test set Accuracy (F1): \", f1_score(y_test, best_knn_model.predict(x_test), average='weighted'))" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Decision Tree" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# importing libraries\nfrom sklearn.tree import DecisionTreeClassifier" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "For depth = 1 the accuracy score is 0.7857142857142857 \nFor depth = 2 the accuracy score is 0.7857142857142857 \nFor depth = 3 the accuracy score is 0.6142857142857143 \nFor depth = 4 the accuracy score is 0.6142857142857143 \nFor depth = 5 the accuracy score is 0.6428571428571429 \nFor depth = 6 the accuracy score is 0.7714285714285715 \nFor depth = 7 the accuracy score is 0.7571428571428571 \nFor depth = 8 the accuracy score is 0.7571428571428571 \nFor depth = 9 the accuracy score is 0.6571428571428571 \n" | |
} | |
], | |
"source": "for d in range(1,10):\n dt = DecisionTreeClassifier(criterion = 'entropy', max_depth = d).fit(x_train, y_train)\n dt_yhat = dt.predict(x_test)\n print(\"For depth = {} the accuracy score is {} \".format(d, accuracy_score(y_test, dt_yhat)))" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "The best value of depth is d = 2 \n" | |
} | |
], | |
"source": "print(\"The best value of depth is d = 2 \")" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "DecisionTreeClassifier(criterion='entropy', max_depth=2)" | |
}, | |
"execution_count": 46, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "## Creating the best model for decision tree with best value of depth 2\nbest_dt_model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 2).fit(x_train, y_train)\nbest_dt_model" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Train set Accuracy (Jaccard): 1 3\nTest set Accuracy (Jaccard): 0.7857142857142857\nTrain set Accuracy (F1): 0.6331163939859591\nTest set Accuracy (F1): 0.6914285714285714\n" | |
} | |
], | |
"source": "## Evaluation Metrics\n# jaccard score and f1 score\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\n\nprint(\"Train set Accuracy (Jaccard): \", jaccard_score(y_train, best_dt_model.predict(x_train), pos_label = \"PAIDOFF\"))\nprint(\"Test set Accuracy (Jaccard): \", jaccard_score(y_test, best_dt_model.predict(x_test), pos_label = \"PAIDOFF\"))\n\nprint(\"Train set Accuracy (F1): \", f1_score(y_train, best_dt_model.predict(x_train), average='weighted'))\nprint(\"Test set Accuracy (F1): \", f1_score(y_test, best_dt_model.predict(x_test), average='weighted'))" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Support Vector Machine" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [], | |
"source": "#importing svm\nfrom sklearn import svm \nfrom sklearn.metrics import f1_score" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "For kernel: linear, the f1 score is: 0.6914285714285714\nFor kernel: poly, the f1 score is: 0.7064793130366899\nFor kernel: rbf, the f1 score is: 0.7275882012724117\nFor kernel: sigmoid, the f1 score is: 0.6892857142857144\n" | |
} | |
], | |
"source": "for k in ('linear', 'poly', 'rbf','sigmoid'):\n svm_model = svm.SVC( kernel = k).fit(x_train,y_train)\n svm_yhat = svm_model.predict(x_test)\n print(\"For kernel: {}, the f1 score is: {}\".format(k,f1_score(y_test,svm_yhat, average='weighted')))" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "We can see the rbf has the best f1 score \n" | |
} | |
], | |
"source": "print(\"We can see the rbf has the best f1 score \")" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "SVC()" | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "## building best SVM with kernel = rbf\nbest_svm = svm.SVC(kernel='rbf').fit(x_train,y_train)\nbest_svm" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Train set Accuracy (Jaccard): 0.7560975609756098\nTest set Accuracy (Jaccard): 0.7272727272727273\nTrain set Accuracy (F1): 0.7682165861513688\nTest set Accuracy (F1): 0.7275882012724117\n" | |
} | |
], | |
"source": "## Evaluation Metrics\n# jaccard score and f1 score\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\n\nprint(\"Train set Accuracy (Jaccard): \", jaccard_score(y_train, best_svm.predict(x_train), pos_label = \"PAIDOFF\"))\nprint(\"Test set Accuracy (Jaccard): \", jaccard_score(y_test, best_svm.predict(x_test), pos_label = \"PAIDOFF\"))\n\nprint(\"Train set Accuracy (F1): \", f1_score(y_train, best_svm.predict(x_train), average='weighted'))\nprint(\"Test set Accuracy (F1): \", f1_score(y_test, best_svm.predict(x_test), average='weighted'))" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Logistic Regression" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# importing libraries\nfrom sklearn.linear_model import LogisticRegression \nfrom sklearn.metrics import log_loss" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "When Solver is lbfgs, logloss is : 0.4920179847937498\nWhen Solver is saga, logloss is : 0.4920163309725047\nWhen Solver is liblinear, logloss is : 0.5772287609479654\nWhen Solver is newton-cg, logloss is : 0.492017801467927\nWhen Solver is sag, logloss is : 0.49201407210049053\n" | |
} | |
], | |
"source": "for k in ('lbfgs', 'saga', 'liblinear', 'newton-cg', 'sag'):\n lr_model = LogisticRegression(C = 0.01, solver = k).fit(x_train, y_train)\n lr_yhat = lr_model.predict(x_test)\n y_prob = lr_model.predict_proba(x_test)\n print('When Solver is {}, logloss is : {}'.format(k, log_loss(y_test, y_prob)))" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "We can see that the best solver is liblinear\n" | |
} | |
], | |
"source": "print(\"We can see that the best solver is liblinear\")" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "LogisticRegression(C=0.01, solver='liblinear')" | |
}, | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# Best logistic regression model with liblinear solver\n\nbest_lr_model = LogisticRegression(C = 0.01, solver = 'liblinear').fit(x_train, y_train)\nbest_lr_model" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "Train set Accuracy (Jaccard): 0.7351778656126482\nTest set Accuracy (Jaccard): 0.6764705882352942\nTrain set Accuracy (F1): 0.7341146337750953\nTest set Accuracy (F1): 0.6670522459996144\n" | |
} | |
], | |
"source": "## Evaluation Metrics\n# jaccard score and f1 score\nfrom sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\n\nprint(\"Train set Accuracy (Jaccard): \", jaccard_score(y_train, best_lr_model.predict(x_train), pos_label = \"PAIDOFF\"))\nprint(\"Test set Accuracy (Jaccard): \", jaccard_score(y_test, best_lr_model.predict(x_test), pos_label = \"PAIDOFF\"))\n\nprint(\"Train set Accuracy (F1): \", f1_score(y_train, best_lr_model.predict(x_train), average='weighted'))\nprint(\"Test set Accuracy (F1): \", f1_score(y_test, best_lr_model.predict(x_test), average='weighted'))" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Model Evaluation using Test set" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": {}, | |
"outputs": [], | |
"source": "from sklearn.metrics import jaccard_score\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import log_loss" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "--2022-05-15 22:43:14-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv\nResolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196\nConnecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 3642 (3.6K) [text/csv]\nSaving to: \u2018loan_test.csv\u2019\n\nloan_test.csv 100%[===================>] 3.56K --.-KB/s in 0s \n\n2022-05-15 22:43:14 (76.4 MB/s) - \u2018loan_test.csv\u2019 saved [3642/3642]\n\n" | |
} | |
], | |
"source": "!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>Unnamed: 0.1</th>\n <th>loan_status</th>\n <th>Principal</th>\n <th>terms</th>\n <th>effective_date</th>\n <th>due_date</th>\n <th>age</th>\n <th>education</th>\n <th>Gender</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>1</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/8/2016</td>\n <td>10/7/2016</td>\n <td>50</td>\n <td>Bechalor</td>\n <td>female</td>\n </tr>\n <tr>\n <th>1</th>\n <td>5</td>\n <td>5</td>\n <td>PAIDOFF</td>\n <td>300</td>\n <td>7</td>\n <td>9/9/2016</td>\n <td>9/15/2016</td>\n <td>35</td>\n <td>Master or Above</td>\n <td>male</td>\n </tr>\n <tr>\n <th>2</th>\n <td>21</td>\n <td>21</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/10/2016</td>\n <td>10/9/2016</td>\n <td>43</td>\n <td>High School or Below</td>\n <td>female</td>\n </tr>\n <tr>\n <th>3</th>\n <td>24</td>\n <td>24</td>\n <td>PAIDOFF</td>\n <td>1000</td>\n <td>30</td>\n <td>9/10/2016</td>\n <td>10/9/2016</td>\n <td>26</td>\n <td>college</td>\n <td>male</td>\n </tr>\n <tr>\n <th>4</th>\n <td>35</td>\n <td>35</td>\n <td>PAIDOFF</td>\n <td>800</td>\n <td>15</td>\n <td>9/11/2016</td>\n <td>9/25/2016</td>\n <td>29</td>\n <td>Bechalor</td>\n <td>male</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \\\n0 1 1 PAIDOFF 1000 30 9/8/2016 \n1 5 5 PAIDOFF 300 7 9/9/2016 \n2 21 21 PAIDOFF 1000 30 9/10/2016 \n3 24 24 PAIDOFF 1000 30 9/10/2016 \n4 35 35 PAIDOFF 800 15 9/11/2016 \n\n due_date age education Gender \n0 10/7/2016 50 Bechalor female \n1 9/15/2016 35 Master or Above male \n2 10/9/2016 43 High School or Below female \n3 10/9/2016 26 college male \n4 9/25/2016 29 Bechalor male " | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "test_df = pd.read_csv('loan_test.csv')\ntest_df.head()" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 68, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# data processing\ntest_df['due_date'] = pd.to_datetime(test_df['due_date'])\ntest_df['effective_date'] = pd.to_datetime(test_df['effective_date'])\ntest_df['dayofweek'] = test_df['effective_date'].dt.dayofweek\n\ntest_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)\ntest_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)\n\nFeature1 = test_df[['Principal','terms','age','Gender','weekend']]\nFeature1 = pd.concat([Feature1,pd.get_dummies(test_df['education'])], axis=1)\nFeature1.drop(['Master or Above'], axis = 1,inplace=True)\n\n\nx_loan_test = Feature1\nx_loan_test = preprocessing.StandardScaler().fit(x_loan_test).transform(x_loan_test)\n\ny_loan_test = test_df['loan_status'].values" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 77, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "[0.67, 0.74, 0.78, 0.74]" | |
}, | |
"execution_count": 77, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# Jaccard\n#jaccard_score(y_train, best_dt_model.predict(x_train), pos_label = \"PAIDOFF\"))\n# KNN\nknn_yhat = best_knn_model.predict(x_loan_test)\njacc1 = round(jaccard_score(y_loan_test, knn_yhat, pos_label = \"PAIDOFF\"),2)\n\n# Decision Tree\ndt_yhat = best_dt_model.predict(x_loan_test)\njacc2 = round(jaccard_score(y_loan_test, dt_yhat, pos_label = \"PAIDOFF\"),2)\n\n# Support Vector Machine\nsvm_yhat = best_svm.predict(x_loan_test)\njacc3 = round(jaccard_score(y_loan_test, svm_yhat, pos_label = \"PAIDOFF\"),2)\n\n# Logistic Regression\nlr_yhat = best_lr_model.predict(x_loan_test)\njacc4 = round(jaccard_score(y_loan_test, lr_yhat, pos_label = \"PAIDOFF\"),2)\n\njss = [jacc1, jacc2, jacc3, jacc4]\njss" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 78, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "[0.66, 0.63, 0.76, 0.66]" | |
}, | |
"execution_count": 78, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# F1_score\n\n# KNN\nknn_yhat = best_knn_model.predict(x_loan_test)\nf1 = round(f1_score(y_loan_test, knn_yhat, average = 'weighted'), 2)\n\n# Decision Tree\ndt_yhat = best_dt_model.predict(x_loan_test)\nf2 = round(f1_score(y_loan_test, dt_yhat, average = 'weighted'), 2)\n\n# Support Vector Machine\nsvm_yhat = best_svm.predict(x_loan_test)\nf3 = round(f1_score(y_loan_test, svm_yhat, average = 'weighted'), 2)\n\n# Logistic Regression\nlr_yhat = best_lr_model.predict(x_loan_test)\nf4 = round(f1_score(y_loan_test, lr_yhat, average = 'weighted'), 2)\n\nf1_list = [f1, f2, f3, f4]\nf1_list" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 79, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "['NA', 'NA', 'NA', 0.57]" | |
}, | |
"execution_count": 79, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# log loss\n\n# Logistic Regression\nlr_prob = best_lr_model.predict_proba(x_loan_test)\nll_list = ['NA','NA','NA', round(log_loss(y_loan_test, lr_prob), 2)]\nll_list" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 80, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>Algorithm</th>\n <th>Jaccard</th>\n <th>F1-score</th>\n <th>Logloss</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>KNN</th>\n <td>0.67</td>\n <td>0.66</td>\n <td>NA</td>\n </tr>\n <tr>\n <th>Decision Tree</th>\n <td>0.74</td>\n <td>0.63</td>\n <td>NA</td>\n </tr>\n <tr>\n <th>SVM</th>\n <td>0.78</td>\n <td>0.76</td>\n <td>NA</td>\n </tr>\n <tr>\n <th>Logistic Regression</th>\n <td>0.74</td>\n <td>0.66</td>\n <td>0.57</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": "Algorithm Jaccard F1-score Logloss\nKNN 0.67 0.66 NA\nDecision Tree 0.74 0.63 NA\nSVM 0.78 0.76 NA\nLogistic Regression 0.74 0.66 0.57" | |
}, | |
"execution_count": 80, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "columns = ['KNN', 'Decision Tree', 'SVM', 'Logistic Regression']\nindex = ['Jaccard', 'F1-score', 'Logloss']\n\naccuracy_df = pd.DataFrame([jss, f1_list, ll_list], index = index, columns = columns)\naccuracy_df1 = accuracy_df.transpose()\naccuracy_df1.columns.name = 'Algorithm'\naccuracy_df1" | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3.8 Watson NLP", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment