Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pb111/3f654b410b0a0e7256d42362566651f2 to your computer and use it in GitHub Desktop.
Save pb111/3f654b410b0a0e7256d42362566651f2 to your computer and use it in GitHub Desktop.
Data Preprocessing Project-Imbalanced Classes Problem
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Preprocessing Project - Imbalanced Classes Problem\n",
"\n",
"\n",
"Imbalanced classes is one of the major problems in machine learning. In this data preprocessing project, I discuss the imbalanced classes problem. I present Python implementation to deal with this problem."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"I have divided this project into various sections which are listed below:-\n",
"\n",
"\n",
"\n",
"\n",
"1.\tIntroduction to imbalanced classes problem\n",
"\n",
"2.\tProblems with imbalanced learning\n",
"\n",
"3.\tExample of imbalanced classes\n",
"\n",
"4.\tApproaches to handle imbalanced classes\n",
"\n",
"5.\tPython implementation to illustrate class imbalance problem\n",
"\n",
"6.\tPrecision - Recall Curve\n",
"\n",
"7. Random over-sampling the minority class\n",
"\n",
"8.\tRandom under-sampling the majority class\n",
"\n",
"9.\tApply tree-based algorithms\n",
"\n",
"10.\tRandom under-sampling and over-sampling with imbalanced-learn\n",
"\n",
"11.\tUnder-sampling : Tomek links\n",
"\n",
"12.\tUnder-sampling : Cluster Centroids\n",
"\n",
"13.\tOver-sampling : SMOTE\n",
"\n",
"14.\tConclusion \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to imbalanced classes problem\n",
"\n",
"\n",
"Any real world dataset may come along with several problems. The problem of **imbalanced class** is one of them. The problem of imbalanced classes arises when one set of classes dominate over another set of classes. The former is called majority class while the latter is called minority class. It causes the machine learning model to be more biased towards majority class. It causes poor classification of minority classes. Hence, this problem throw the question of “accuracy” out of question. This is a very common problem in machine learning where we have datasets with a disproportionate ratio of observations in each class.\n",
"\n",
"\n",
"**Imbalanced classes problem** is one of the major problems in the field of data science and machine learning. It is very important that we should properly deal with this problem and develop our machine learning model accordingly. If this not done, then we may end up with higher accuracy. But this higher accuracy is meaningless because it comes from a meaningless metric which is not suitable for the dataset in question. Hence, this higher accuracy no longer reliably measures model performance. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Problems with imbalanced learning\n",
"\n",
"\n",
"The problem of imbalanced classes is very common and it is bound to happen. For example, in the above example the number of patients who do not have the rare disease is much larger than the number of patients who have the rare disease. So, the model does not correctly classify the patients who have the rare disease. This is where the problem arises.\n",
"\n",
"\n",
"The problem of learning from imbalanced data have new and modern approaches. This learning from imbalanced data is referred to as **imbalanced learning**. \n",
"\n",
"\n",
"Significant problems may arise with imbalanced learning. These are as follows:-\n",
"\n",
"\n",
"1.\tThe class distribution is skewed when the dataset has underrepresented data.\n",
"\n",
"2.\tThe high level of accuracy is simply misleading. In the previous example, it is high because most patients do not \n",
" have the disease not because of the good model. \n",
" \n",
"3.\tThere may be inherent complex characteristics in the dataset. Imbalanced learning from such dataset requires new \n",
" approaches, principles, tools and techniques. But, it cannot guarantee an efficient solution to the business problem.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Example of imbalanced classes\n",
"\n",
"\n",
"The problem of imbalanced classes may appear in many areas including the following:-\n",
"\n",
"\n",
"1.\tDisease detection\n",
"\n",
"2.\tFraud detection\n",
"\n",
"3.\tSpam filtering\n",
"\n",
"4.\tEarthquake prediction\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Approaches to handle imbalanced classes\n",
"\n",
"\n",
"In this section, I will list various approaches to deal with the imbalanced class problem. These approaches may fall under two categories – dataset level approach and algorithmic ensemble techniques approach. The various methods to deal with imbalanced class problem are listed below. I will describe these techniques in more detail in the following sections.\n",
"\n",
"\n",
"1.\tRandom Undersampling methods\n",
"\n",
"2.\tRandom Oversampling methods\n",
"\n",
"3. Tree-based algorithms\n",
"\n",
"4. Resampling with imbalanced-learn\n",
"\n",
"5. Under-sampling : Tomek links\n",
"\n",
"6. Under-sampling : Cluster Centroids\n",
"\n",
"7. Over-sampling : SMOTE\n",
"\n",
"\n",
"I have discussed these methods in detail in the readme document."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Python implementation to illustrate class imbalance problem\n",
"\n",
"\n",
"\n",
"Now, I will perform Python implementation to illustrate class imbalance problem."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Python libraries\n",
"\n",
"I will start off by importing the required Python libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# import Python libraries\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import dataset\n",
"\n",
"\n",
"Now, I will import the dataset with the usual Python `read_csv()` function."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/creditcard.csv'\n",
"\n",
"df = pd.read_csv(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset description\n",
"\n",
"\n",
"I have used the **Credit Card Fraud Detecttion** dataset for this project. I have downloaded this project from the Kaggle website. This dataset can be found at the following url-\n",
"\n",
"\n",
"https://www.kaggle.com/mlg-ulb/creditcardfraud\n",
"\n",
"\n",
"This dataset contains transactions made by european credit card holders in September 2013. It represents transactions that occurred in two days. We have 492 fraudulent transactions out of total 284,807 transactions. This dataset is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions.\n",
"\n",
"\n",
"Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. So, our target variable is `Class` variable.\n",
"\n",
"\n",
"\n",
"Given the class imbalance ratio, it is recommended to measure the accuracy using the `Area Under the Precision-Recall Curve (AUPRC)`. Confusion matrix accuracy is not meaningful for unbalanced classification.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exploratory data analysis\n",
"\n",
"\n",
"Now, I will conduct exploratory data analysis to gain an insight into the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(284807, 31)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check shape of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 284,807 instances and 31 columns in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Time</th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>...</th>\n",
" <th>V21</th>\n",
" <th>V22</th>\n",
" <th>V23</th>\n",
" <th>V24</th>\n",
" <th>V25</th>\n",
" <th>V26</th>\n",
" <th>V27</th>\n",
" <th>V28</th>\n",
" <th>Amount</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>-1.359807</td>\n",
" <td>-0.072781</td>\n",
" <td>2.536347</td>\n",
" <td>1.378155</td>\n",
" <td>-0.338321</td>\n",
" <td>0.462388</td>\n",
" <td>0.239599</td>\n",
" <td>0.098698</td>\n",
" <td>0.363787</td>\n",
" <td>...</td>\n",
" <td>-0.018307</td>\n",
" <td>0.277838</td>\n",
" <td>-0.110474</td>\n",
" <td>0.066928</td>\n",
" <td>0.128539</td>\n",
" <td>-0.189115</td>\n",
" <td>0.133558</td>\n",
" <td>-0.021053</td>\n",
" <td>149.62</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.0</td>\n",
" <td>1.191857</td>\n",
" <td>0.266151</td>\n",
" <td>0.166480</td>\n",
" <td>0.448154</td>\n",
" <td>0.060018</td>\n",
" <td>-0.082361</td>\n",
" <td>-0.078803</td>\n",
" <td>0.085102</td>\n",
" <td>-0.255425</td>\n",
" <td>...</td>\n",
" <td>-0.225775</td>\n",
" <td>-0.638672</td>\n",
" <td>0.101288</td>\n",
" <td>-0.339846</td>\n",
" <td>0.167170</td>\n",
" <td>0.125895</td>\n",
" <td>-0.008983</td>\n",
" <td>0.014724</td>\n",
" <td>2.69</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.0</td>\n",
" <td>-1.358354</td>\n",
" <td>-1.340163</td>\n",
" <td>1.773209</td>\n",
" <td>0.379780</td>\n",
" <td>-0.503198</td>\n",
" <td>1.800499</td>\n",
" <td>0.791461</td>\n",
" <td>0.247676</td>\n",
" <td>-1.514654</td>\n",
" <td>...</td>\n",
" <td>0.247998</td>\n",
" <td>0.771679</td>\n",
" <td>0.909412</td>\n",
" <td>-0.689281</td>\n",
" <td>-0.327642</td>\n",
" <td>-0.139097</td>\n",
" <td>-0.055353</td>\n",
" <td>-0.059752</td>\n",
" <td>378.66</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>-0.966272</td>\n",
" <td>-0.185226</td>\n",
" <td>1.792993</td>\n",
" <td>-0.863291</td>\n",
" <td>-0.010309</td>\n",
" <td>1.247203</td>\n",
" <td>0.237609</td>\n",
" <td>0.377436</td>\n",
" <td>-1.387024</td>\n",
" <td>...</td>\n",
" <td>-0.108300</td>\n",
" <td>0.005274</td>\n",
" <td>-0.190321</td>\n",
" <td>-1.175575</td>\n",
" <td>0.647376</td>\n",
" <td>-0.221929</td>\n",
" <td>0.062723</td>\n",
" <td>0.061458</td>\n",
" <td>123.50</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2.0</td>\n",
" <td>-1.158233</td>\n",
" <td>0.877737</td>\n",
" <td>1.548718</td>\n",
" <td>0.403034</td>\n",
" <td>-0.407193</td>\n",
" <td>0.095921</td>\n",
" <td>0.592941</td>\n",
" <td>-0.270533</td>\n",
" <td>0.817739</td>\n",
" <td>...</td>\n",
" <td>-0.009431</td>\n",
" <td>0.798278</td>\n",
" <td>-0.137458</td>\n",
" <td>0.141267</td>\n",
" <td>-0.206010</td>\n",
" <td>0.502292</td>\n",
" <td>0.219422</td>\n",
" <td>0.215153</td>\n",
" <td>69.99</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 31 columns</p>\n",
"</div>"
],
"text/plain": [
" Time V1 V2 V3 V4 V5 V6 V7 \\\n",
"0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n",
"1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 \n",
"2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 \n",
"3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 \n",
"4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 \n",
"\n",
" V8 V9 ... V21 V22 V23 V24 \\\n",
"0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 \n",
"1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 \n",
"2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 \n",
"3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 \n",
"4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 \n",
"\n",
" V25 V26 V27 V28 Amount Class \n",
"0 0.128539 -0.189115 0.133558 -0.021053 149.62 0 \n",
"1 0.167170 0.125895 -0.008983 0.014724 2.69 0 \n",
"2 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0 \n",
"3 0.647376 -0.221929 0.062723 0.061458 123.50 0 \n",
"4 -0.206010 0.502292 0.219422 0.215153 69.99 0 \n",
"\n",
"[5 rows x 31 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview of the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `df.head()` function gives the preview of the dataset. We can see that there is a `Class` column in the dataset which is our target variable.\n",
"\n",
"\n",
"I will check the distribution of the `Class` column with the `value_counts()` method as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 284315\n",
"1 492\n",
"Name: Class, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the distribution of Class column\n",
"\n",
"df['Class'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, we have 492 fraudulent transactions out of total 284,807 transactions in the dataset. The `Class` column takes value `1 for \n",
"fraudulent transactions` and `0 for non-fraudulent transactions`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will find the percentage of labels 0 and 1 within the `Class` column."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.998273\n",
"1 0.001727\n",
"Name: Class, dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# percentage of labels within the Class column\n",
"\n",
"df['Class'].value_counts()/np.float(len(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `Class` column is highly imbalanced. It contains 99.82% labels as `0` and 0.17% labels as `1`. \n",
"\n",
"Now, I will plot the bar plot to confirm this."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x6b5b677400>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAC15JREFUeJzt3V+Infldx/H3ZxOjF60tmFFq/jSBpmgUYWWIhV640orJCslNkQSKWpbmxijSIkaUVeONthcFIf4JWqsFN8Ze6FAjuahbBDU1s7QuJiE6xGqGiDttlwUpmsZ+vZixHk5Ocp5JTnKSb94vGDjP8/xyzpcwefPMc84zSVUhSerlmXkPIEmaPeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJamhrfN64e3bt9eePXvm9fKS9ER65ZVXvlRVC9PWzS3ue/bsYXl5eV4vL0lPpCT/OmSdl2UkqSHjLkkNGXdJasi4S1JDxl2SGpoa9yQfT/Jakn+8y/Ek+c0kK0leTfL9sx9TkrQZQ87cPwEcvMfxQ8C+ja/jwG8/+FiSpAcxNe5V9dfAV+6x5AjwR7XuIvDWJG+b1YCSpM2bxU1MO4AbI9urG/v+fXxhkuOsn92ze/fuGbz0w7fn5F/Me4RWvvjrPzrvEaSnwizeUM2EfRP/1+2qOlNVi1W1uLAw9e5ZSdJ9mkXcV4FdI9s7gZszeF5J0n2aRdyXgB/f+NTMu4A3quqOSzKSpEdn6jX3JC8BzwHbk6wCvwx8E0BV/Q5wHngeWAG+CnzgYQ0rSRpmatyr6tiU4wX81MwmkiQ9MO9QlaSGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLU0KC4JzmY5FqSlSQnJxzfneTlJJ9P8mqS52c/qiRpqKlxT7IFOA0cAvYDx5LsH1v2S8C5qnoWOAr81qwHlSQNN+TM/QCwUlXXq+oWcBY4MramgG/dePwW4ObsRpQkbdaQuO8Aboxsr27sG/UrwPuTrALngZ+e9ERJjidZTrK8trZ2H+NKkoYYEvdM2Fdj28eAT1TVTuB54JNJ7njuqjpTVYtVtbiwsLD5aSVJgwyJ+yqwa2R7J3dednkBOAdQVX8HfAuwfRYDSpI2b0jcLwH7kuxNso31N0yXxtb8G/AegCTfzXrcve4iSXMyNe5VdRs4AVwArrL+qZjLSU4lObyx7MPAB5P8A/AS8JNVNX7pRpL0iGwdsqiqzrP+RunovhdHHl8B3j3b0SRJ98s7VCWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJamhQXFPcjDJtSQrSU7eZc2PJbmS5HKSP57tmJKkzdg6bUGSLcBp4IeBVeBSkqWqujKyZh/wC8C7q+r1JN/+sAaWJE035Mz9ALBSVder6hZwFjgytuaDwOmqeh2gql6b7ZiSpM0YEvcdwI2R7dWNfaPeCbwzyd8kuZjk4KQnSnI8yXKS5bW1tfubWJI01ZC4Z8K+GtveCuwDngOOAb+X5K13/KGqM1W1WFWLCwsLm51VkjTQkLivArtGtncCNyes+fOq+lpV/QtwjfXYS5LmYEjcLwH7kuxNsg04CiyNrfkz4IcAkmxn/TLN9VkOKkkabmrcq+o2cAK4AFwFzlXV5SSnkhzeWHYB+HKSK8DLwM9V1Zcf1tCSpHub+lFIgKo6D5wf2/fiyOMCPrTxJUmaM+9QlaSGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLU0KC4JzmY5FqSlSQn77HufUkqyeLsRpQkbdbUuCfZApwGDgH7gWNJ9k9Y92bgZ4DPzXpISdLmDDlzPwCsVNX1qroFnAWOTFj3a8BHgP+a4XySpPswJO47gBsj26sb+74hybPArqr69L2eKMnxJMtJltfW1jY9rCRpmCFxz4R99Y2DyTPAx4APT3uiqjpTVYtVtbiwsDB8SknSpgyJ+yqwa2R7J3BzZPvNwPcCn03yReBdwJJvqkrS/AyJ+yVgX5K9SbYBR4Gl/ztYVW9U1faq2lNVe4CLwOGqWn4oE0uSppoa96q6DZwALgBXgXNVdTnJqSSHH/aAkqTN2zpkUVWdB86P7XvxLmufe/CxJEkPwjtUJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqaFBcU9yMMm1JCtJTk44/qEkV5K8muQzSd4++1ElSUNNjXuSLcBp4BCwHziWZP/Yss8Di1X1fcCngI/MelBJ0nBDztwPACtVdb2qbgFngSOjC6rq5ar66sbmRWDnbMeUJG3GkLjvAG6MbK9u7LubF4C/nHQgyfEky0mW19bWhk8pSdqUIXHPhH01cWHyfmAR+Oik41V1pqoWq2pxYWFh+JSSpE3ZOmDNKrBrZHsncHN8UZL3Ar8I/GBV/fdsxpMk3Y8hZ+6XgH1J9ibZBhwFlkYXJHkW+F3gcFW9NvsxJUmbMTXuVXUbOAFcAK4C56rqcpJTSQ5vLPso8CbgT5N8IcnSXZ5OkvQIDLksQ1WdB86P7Xtx5PF7ZzyXJOkBeIeqJDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGBsU9ycEk15KsJDk54fg3J/mTjeOfS7Jn1oNKkoabGvckW4DTwCFgP3Asyf6xZS8Ar1fVO4CPAb8x60ElScMNOXM/AKxU1fWqugWcBY6MrTkC/OHG408B70mS2Y0pSdqMrQPW7ABujGyvAj9wtzVVdTvJG8C3AV8aXZTkOHB8Y/M/k1y7n6E10XbG/r4fR/FnuqfRE/G9+QR5+5BFQ+I+6Qy87mMNVXUGODPgNbVJSZaranHec0jj/N6cjyGXZVaBXSPbO4Gbd1uTZCvwFuArsxhQkrR5Q+J+CdiXZG+SbcBRYGlszRLwExuP3wf8VVXdceYuSXo0pl6W2biGfgK4AGwBPl5Vl5OcAparagn4feCTSVZYP2M/+jCH1kRe7tLjyu/NOYgn2JLUj3eoSlJDxl2SGjLuktTQkM+56zGT5LtYvyt4B+v3E9wElqrq6lwHk/TY8Mz9CZPk51n/FRAB/p71j6oGeGnSL3WT9HTy0zJPmCT/BHxPVX1tbP824HJV7ZvPZNK9JflAVf3BvOd4Wnjm/uT5OvCdE/a/beOY9Lj61XkP8DTxmvuT52eBzyT5Z/7/F7rtBt4BnJjbVBKQ5NW7HQK+41HO8rTzsswTKMkzrP8q5h2s/6NZBS5V1f/MdTA99ZL8B/AjwOvjh4C/rapJP3XqIfDM/QlUVV8HLs57DmmCTwNvqqovjB9I8tlHP87TyzN3SWrIN1QlqSHjLkkNGXdJasi4S1JD/wvQ9V7if0uClQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# view the distribution of percentages within the Class column\n",
"\n",
"\n",
"(df['Class'].value_counts()/np.float(len(df))).plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above bar plot confirms our finding that the `Class` variable is highly imbalanced. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Misleading accuracy for imbalanced classes\n",
"\n",
"\n",
"Now, I will demonstrate that accuracy is misleading for imbalanced classes. Most of the machine learning algorithms are designed to maximize the overall accuracy by default. But this maximum accuracy is misleading. We can confirm this with the following analysis.\n",
"\n",
"\n",
"I will fit a very simple Logistic Regression model using the default settings. I will train the classifier on the imbalanced dataset."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# declare feature vector and target variable\n",
"\n",
"X = df.drop(['Class'], axis=1)\n",
"y = df['Class']"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# import Logistic Regression classifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"\n",
"# instantiate the Logistic Regression classifier\n",
"logreg = LogisticRegression()\n",
"\n",
"\n",
"# fit the classifier to the imbalanced data\n",
"clf = logreg.fit(X, y)\n",
"\n",
"\n",
"# predict on the training data\n",
"y_pred = clf.predict(X)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I have trained the model. I will check its accuracy."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 99.90%\n"
]
}
],
"source": [
"# import the accuracy metric\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"\n",
"# print the accuracy\n",
"accuracy = accuracy_score(y_pred, y)\n",
"\n",
"print(\"Accuracy : %.2f%%\" % (accuracy * 100.0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Accuracy paradox\n",
"\n",
"\n",
"Thus, our Logistic Regression model for credit card fraud detection has an accuracy of 99.90%. It means that for each 100 transactions it classified, 99.90% were classified as genuine.\n",
"\n",
"\n",
"It does not mean that our model performance is excellent. I have previously shown that our dataset have 99.90% genuine transactions and 0.1% fraudulent transactions. Our Logistic Regression classifier predicted all transactions as genuine. \n",
"Then we have a accuracy of 99.90% because it correctly classified 99.90% transactions as genuine.\n",
"\n",
"\n",
"Thus, this algorithm is 99.90% accurate. But it was horrible at classifying fraudulent transactions. So, we should have other ways to measure the model performance. One such measure is confusion matrix described below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Confusion matrix\n",
"\n",
"\n",
"A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
"\n",
"\n",
"Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
"\n",
"\n",
"**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
"\n",
"\n",
"**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
"\n",
"\n",
"**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
"\n",
"\n",
"\n",
"**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
"\n",
"\n",
"\n",
"These four outcomes are summarized in a confusion matrix given below.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix:\n",
" [[284240 75]\n",
" [ 203 289]]\n"
]
}
],
"source": [
"# import the metric\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"\n",
"# print the confusion matrix\n",
"cnf_matrix = confusion_matrix(y, y_pred)\n",
"\n",
"\n",
"print('Confusion matrix:\\n', cnf_matrix)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation of confusion matrix\n",
"\n",
"\n",
"Now, I will interpret the confusion matrix.\n",
"\n",
"\n",
"- Out of the total 284315 transactions which were predicted genuine, the classifier predicted correctly 284240 of them. It means that the classifer predicted 284240 transactions as genuine and they were actually genuine. Also, it predicted 75 transactions as genuine but it were fraudulent. So, we have `284240 True Positives(TP)` and `75 False Positives(FP)`.\n",
"\n",
"\n",
"- Out of the total 492 transactions which were not predicted as genuine, the classifier predicted correctly 289 of them. It means that the classifer did not predict 289 transactions as genuine and they were actually not genuine. SO, they were fraudulent. Also, it did not predict 203 transactions as genuine but they were genuine. So, we have `289 True Negatives(TN)` and `203 False Negatives(FN)`.\n",
"\n",
"\n",
"\n",
"- So, out of all the 284807 transactions, the classifier correctly predicted 284529 of them. Thus, we will get the accuracy of\n",
"`(284240+289)/(284240+289+75+203) = 99.90%.`\n",
"\n",
"\n",
"\n",
"- But this is not the true picture. The confusion matrix allows us to obtain a true picture of the performance of the algorithm. The algorithm tries to predict the fraudulent transactions out of the total transactions. It correctly predicted 289 transactions as fraudulent out of all the 284807 transactions. In this case the accuracy becomes `(289/284807)=0.10%.`\n",
"\n",
"\n",
"\n",
"- Moreover, we have `203+289=492` transactions as fraudulent. The algorithm is correctly classifying 289 of them as fraudulent while it fails to predict 203 transactions which were fraudulent. In this case the accuracy becomes `(289/492)=58.74%.`\n",
"\n",
"\n",
"So, we can conclude that the accuracy of 99.90% is misleading because we have imbalanced classes. We need more subtle way to evaluate the performance of the model.\n",
"\n",
"\n",
"There is another metric called `Classification Report` which helps to evaluate model performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification report\n",
"\n",
"\n",
"\n",
"**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later sections.\n",
"\n",
"\n",
"\n",
"We can plot a classification report as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification Report:\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 284315\n",
" 1 0.79 0.59 0.68 492\n",
"\n",
" micro avg 1.00 1.00 1.00 284807\n",
" macro avg 0.90 0.79 0.84 284807\n",
"weighted avg 1.00 1.00 1.00 284807\n",
"\n"
]
}
],
"source": [
"# import the metric\n",
"from sklearn.metrics import classification_report\n",
"\n",
"\n",
"# print classification report\n",
"print(\"Classification Report:\\n\\n\", classification_report(y, y_pred))"
]
},
{
"attachments": {
"Precision.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAAA4CAYAAAAGnO/aAAAFBElEQVR42uydj5G7KBTH2ZttgCvBLcEtwZTglcCVkJRgSnBLyJZgStASYgmxhNw487h5ww8QMUaSfD8zzmSUh2D4Cjz585cAACTJJx4BSJBCCNFE2p6FEDv6PcaR0THSCSEGI3xG50e7Hzx6AKbFeRVCKCGEZOfH3zc6pEPQF0t82sbFnu53ssQLAGAoOsRMoUkS2VxxaoGOYSo8fgDcjALJI8Q5UkeKk9fKWQoPAQ4hkCpdpN0QKa7B6IdCnABY6BeKWm54fwDelpAmaoxNSWFOeMQApCPOnLy8TUre2g/81+AJxTm37Gqbo0OYQ4rfOTEIAbwbh2dJKBxCAECcAACIEwCIEwAAcQIAcQKQLNLxGwCwEQ0dVzag4MrOVx67C7O5TYRPCv4hV88E8E1OTXli6pi2lsZFfkfG0VI83xhfCVJtOviGOxXsrZVS06KIHNrF0fYFigFIlalCrhKdmLqnAcyxlBQHAE8rzuQmpgLwasR6a5ObmAoAxPkncJwAsAKfC/pmI79MnIrO59Ts/WD9t4qm6/AZAQWFH1gN7PIC5xTXwML3LGxO99CeZnM6UWVplneUhn/oXOOx1+nbG60GSfnqjbRWdC1nyzTq55JRHAfLMo0ALOpzSvrk4PLWnsi2ZNfN+BR5fHNLvKaQtPMpt5yvHPc2hVk40tkE2As2GTcPPM/jUpbrrWMhKgCCxVkZR02FTnlsKyZOXtOWrEDfHHGUhpMp83iFW8syiJVFXK6XTGYRp80+o/tUnvxeLX1vHdfeYzMH82P63AO+gRcSZ+xnkspR02lqT2GRRoGuPd8dWzqmxHWleGy1/D7Avp4o3PoFUjvikp5nBMDd+5whdBP9VV/tK42wZ0uY0FFABxKOong61l89BtgXE46v3ghnMjzpyxlsy8cWy5RIJhqXmMywS/ghAZUkoIK9HHYB4gltDr5SsxFrSyXAFrNShpXC+kQz1pj/CiG+hBB/0+8s0CkzPDCtAGwqzt+JZiCvhabCysA+sCmiH/qEUs5Ir6sPnRvh1gIOIYhzdY4kkNLTxytYk1R4wqqA+5UOEZ8Da7vjRBpKI9xafFFzM/bAYJEXEKe8U39PehwoByrUylJjFkyUHTVBbbtO5TMKnK35ug+c9tZTGkrHd86SrvcLancAvB1/23xO7SXdBcSlmNNFF2otxM7RdN0b/bXBUQMVTJw9hetZU9K8t55zeqDPLQdjNJJ5r5zlX7K07yzpDR0hZMZ1YGI2n9EORREAEEq1oH9bsReVuRpBy1YwaNj1OrF5tDH5b4zKak7e4RMAswpna+mzu3aXlhS2jdgjUzIxpDZHOPfspm3m++q4Hpp3hWIHQvvprprsNnNYpAhcpeJkGfqZAqGb714X2Kead5AgpwWFNXZ3aT22+nKH9F8eLE7BZkYtyfv/Q1KxNCZwscWnFz59cClb9OG6BfcdWDMa4gSribN/whdDDMoiznfJO0iUtXaX9k2zi0nj2vmtVsg7nEIgOXEWd/bWri1OOaNfG5V3bJ4LHo1NfDkbjHFOOO2NkWZ5h7zrZWtSzzt4g5qzeGAa18xvMWM1i6jFzuEQAs9KPTFaRwSOYoolxS1JAEDNSag1a04AIM71xYVmLQCvBMQJAMQJXoR32116aX6xGzdYlTxgd2nlsbtYbFTifU7bfMzLjJ2xffYYBQTemhuatQCkCQaQAwBQcwLwVvwXAAD//wtFB2TtUkP9AAAAAElFTkSuQmCC"
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision\n",
"\n",
"\n",
"Precision can be defined as the percentage of correctly predicted positive outcomes out of all the predicted positive outcomes.\n",
"It can be given as the ratio of true positives (TP) to the sum of true and false positives (TP + FP). \n",
"\n",
"\n",
"Mathematically, precision is defined as\n",
"\n",
"\n",
"![Precision.png](attachment:Precision.png)\n",
"\n",
"\n",
"\n",
"Precision is more concerned with the positive class than the negative class.\n"
]
},
{
"attachments": {
"Recall.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAMoAAAA4CAYAAAC8P2e6AAAEsElEQVR42uydi23jOBCGZw/XAFvQlcArQVuCWnBKWJdAl2CX4JQQlyCVYJUQl5BDgBEwmONDIuWYsv8PEHZtkZKp8Ndwhq+/CACQ5G88ApDgg4jazLy/iehCRDsi6oio4eObgYhuKn3D6Q9ENOLRg60JpeeKLjFE9MWHUd9/C+OT/w3l8fF93vH5HR492BK9EoIkVul3XOmX5Jk4c5qulocAHwWkuHmaSHN4L7jnlPcPhAK2wlAgMFOQ9xsLoYCtMD5AZGvcG4BqmONv5OSBQw8glESels+7mgqKfhTwKHxCaNg/mfpfAIBF2RJw5gGAUACAUACAUACAUACAUMCLYAL/XzsPAJtjx8PuP3gY/RTq/RTf20i+q8hz5e820fv+S312XNDYBBvDnUGnysfiWC7PVJZf6g/XBc4BsMishjqGtja55hzp4DpvsfML1EWqB/VY2+SaAC5SDgehgHs78xdR2QCAUAJMfkuDxwgglDCNsiwAPC0lw+x3bFX2kTQt+zDS+kwRM43lOdJyjvboSduoIILlyNwef05QkzPfcfz7mOg0mpaskXF1w6t6OE/aL08MXq/kMUXc9H3PfF048+BhQnHquAYqq7YOofBxx+caYSFCM9p6FpsWlPNYmVAEbk2hNOK55BxXVKvXsSjyDR4Sy1GJgZRV+BJL0Uxp24BQetWU+wosYxMSGywKeJiPsheW5S1gNSjRIWlUWl9Q4F/1+SJ60Q1bLvvE44Yg4jo45FoUEmN8cvLmpp2aP0cxTsjCooB7Uzp6OPQmX7Ky4JK0lptiAy9AcFph7SgA7i4UCvgW75Fz0jLMSWuU7xMKL5Ny+u8BnHkIZTGDqvCysh/YUnQRcU1pT8pXiVV6GxixbAMiXJuRfaTc4x9UuecSyhzn+OSppJ2oUHv+vPNU4lbkHzggsPOk1cK4cF49Aajj65iMZiIASebMRxn5+O3J33Go9iKaUYMSxR/li9wCUYRWCGXkdKNaFd3w9ay6z4Hv5dT1nYiKjULAln97O6OMAIANUOITyRegnmHYi1mJ07lzhXOMcsot/V09I7OP3EvP3OxrWlEfpCvK2RPsCO14NVlxX4dvaserVkzrramp2gTKqpvpx0iHtOOKn1rX2BRuxwceVEH6GW9bClSMdmEeigwTqsWypHCR/jQrrmMT11gt6gV+Rii50xhy+5fkblel0cNHhMMPEaszCP/1uEbUC9RDboUfMyv6TQm1VOg/TWorvT0/U7vEakIodXMrWOlmjRELW9nxSgcgUlb4TVhNOOsv4uivve1CJwIIa/y+e5fVzLQMzvM5FAWDjwKiGNEv9rZRazKXRU0w7Lj1ujiPSAyPmDhV/Ls/lA/UpIbBJ5pgPb8c3jHAFk0vncf94O+7Z1kNR69yml6xJhiaXmBVmhkjA+aOHsgNdpQuKjKrCQahgBJSo6lpxohqWkEspdG5ZBQMQgHPQKlPld0RCeCjbMFHKQle+OhDU9xhUQD4fxPMQCjPQ8mOV+ZJyr0kj50xIngoCDWDinCBHa+u4vtYPjkfJZa+pqaX/t2p3b18+ZfOMelR1cBP+1BPAZpe4J6MeAQAvBCwKADM4L8AAAD//08Owr0ITzbtAAAAAElFTkSuQmCC"
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recall\n",
"\n",
"\n",
"Recall can be defined as the percentage of correctly predicted positive outcomes out of all the actual positive outcomes.\n",
"It can be given as the ratio of true positives (TP) to the sum of true positives and false negatives (TP + FN). **Recall** is also called **Sensitivity**.\n",
"\n",
"\n",
"Mathematically, recall can be given as\n",
"\n",
"\n",
"\n",
"![Recall.png](attachment:Recall.png)\n",
"\n"
]
},
{
"attachments": {
"f1-score.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAU8AAAA8CAYAAADmDDgpAAAJmElEQVR42uydjbGbOhOGlUwa4CuBlEBKwCVwSsAl4BI4JeASTAlQgl2CKcGUwDe5s5q7o6tfjAzY7zPD5AQECAm93tXP8lMAAAAIBuIJQBi1EGISQhQzzy/o/BpF+dI66wzl3lmOQTwBWJBM+ffV54NwTrTpOFiOAQAWFs+/Fkoy8/yEzod4vh6bdRlsef5CeQIQxI22uYywdN4DuO0AAADxBACA1wC3HaxFSSPPGbnBR9onqF8wI/e2V87r2PED+zuntNwlzuiaI9uXUJpRue7ftBXtl9sghGhZmprd61vjfteKe/6XlK7Rs2c2nS/TV5o8f1N+1LzI8vvSlN9Zyf+SrFkP/B1K2fVkOd3QvMAncBFC3IUQjbI/p078UnNOxo5JwZpo4w3roRmYqWh/oqTV3SvR7MsMgws15Vn3fLnH+fLYXZNn2/6JhEydOiWnRKUR627Neqg05Z05ppEtOmAEwNrIeZO60euGjuWGl/2qiEXBrDdbY+BiLdM2mnRXauA+DW0y3CvV5F93fkr3qi3l9NCI4UTPkxjyWUWuv7Xq4a4RX/nOXF8hnujzBFth1OxrWYPUMShpW2aVCHJbheG6hZK29biHK/+NpjEPnteo6FxTns90vDI8z2g4LwnsSulmTKNaox4GwzOPr5oGhj5PsGV6Jp7HAGErmBiY3M1ESdtr0v0JyOuJxLOka90of2dP8cwdzzQo6WLQMCv3MFM8X1UPB+U6KesDFhBPAMKtJ57+FOHawmIZDmxAKGfCcbBYhtxt9yFmH+aZ8tsuXG+x6qFkP1Y9DRZVkX9gIJ5gV6I5zugCSCKkdYlazyynhIS0Jovua6F8jBHL+2iw8J/piolVDw0b6R/XeDnR5wm2jLQgQi2h1sPFTT3TJp7WXq0RgzOJZhGQ58zi4oqIU49iEKseCrI4jx7CWUI8wSe65jlzJ0P4pkZVOCwXmVZY0lae4lcYnqH3tIxc+SiUdHsgVj3IH5KbR7dGCvEE706tsRikdRE66Xkgd64wWB4164cb2AR93ZzOLEC8G0OjP3vm+cgWDqj5kINmw47qNFY99AarsmTlk9IWrbx+oM2CDYhmRSOqBbNC5bQd3QqjlFkUcmTbtFqn1DQi3Qh4yqbLjBrrSb33yO57o7mFJ3qGUXHfv9mz5kwce3JXz5p8+KwwUq8l86Mro3aGBW9jzXqQXknJViFJS7Rlq69kftYuq/+88LVjUi/YFhU18I4mGF+eCOi79Ls0oXrA3vj5hHt1UvqlwHZp6Jf1D7lRv+mX92JwNQEAC5NYlsuBbVKSSJpEdYo5IgnLE4B/GyJe9H1xdUx/mSxrgSGeACz4oncohl0xGQJK8OPTSj/EHbv/Hf3n4J3p8ILvjjuJU7Yx8QRg1/gsz0xZn1hOUwvUQaMlLVuVkQVOVV3OUgk622umGPhO/eDBeROaxlWxfKnBa3NlWorp/msjA9Xq5krKvuseTQGAeOSRB4tswWR1E6h1AVYbJW1ocFl5v0lZMeIT4DWhvsM9WebNk98gBwB4EHuwyBZMtlb+bwpc+mBBU+cGl601glIoS8NMI9Qh0bvvTJTnbM8uO7MFnwUALGwZ3iNe/2EIJisUi9QWXfzKRo4bh8iYxKN29BHarpu8KHr3ElxnjLLPDZYLwEfTOUbaMzqeP2nZTmxgqjSI7BRg2bms3btBPG0iP7FVVuq2h2+hyE8VJDO8gylwxsWEDdsbbd1cy7C2uNWFxSKsLZlRLcxGcWnVRu47Ojx5iucUKJ57H51uLBa+bx9pCVsCAD8Sz4EFk3imLLq2uvE06j3lwMxlhuX58BTPx0zLc49Uhj7OFK84AHHIHf2ALvH04eK4N083OYKmcivJtbqmCRTPxuM5tzhgVFoGhzB3F4BI1AGu8lzxnCyu5EMjpo1FbIXHZ09rgwi5njVlg1um+2/NrS0do+oYcQcgEo1nR+mz4nkxuJq6eZ66vrdUEcPSMc+ztIhq4hCjh+H+W7PiMsprZ9iusDwBCMc3GPKVVqGcPATwMHPFik8wWZOFN7BPvd40ghaywoh/AnZgwW6F4bqCXXvc4GcSHh6DQ6edfd4BgN3guwoF4erAu1A/ufqq2Mm0tU+v485QT53l2D/4BEPG+mfwiWTKv68+H8TnZPGmDy5P2xQY5EKi+T/24aURZQ0+rGEVT3RnyPNaFOWmuc08ZnW/a2a+ph7mb80GfWpMpAZgl9w/8Jlt3SvBXS85m1CNiDsAfJaQQDw9jpnc9h59nAAAYOYXigBEhgeZvgkhjqxLJ6H9J8P32eXxA/s710ybUwNjCxZAW+2rz9jUNbkNSt9kze71rRk44JYID4bdUt74tLdvy7fMfb/PnrHy+9KU33lDfatr1ht/51J2PVmuNzRHsEcu1J+mrmbKLcFGeOzUmrlRrsDUghqaOsfVtLgi0ezLDC6bLWh37nG+mBGkO2MRfdRutGKh+K5Luu1r1lulqZ/MMe1s0T5PAJbGtnLLFi9A/bonD0ztWoLLxdoW/PmqCRBjaji2oN25x/lzg3TL8ImJIZ/VhsRzzXq7GxaGNJb4tbPE8yfaNHgxuilvLWtgOgYlbcusDGH5blTLrlkp97Ldw5V/XUi/wfMaFZ1ryvOZjleG5zFNGUw2WNdr1NtgKKNx6Tm36PMEW6Bn4nkMELaCuXUm9zFR0uoGQv8E5PVE4lnSteSy4LOneOaOZxqUdEtz93DxbdZnyPLrNertoFwnZX3GAuIJ3pVkZvpThGsLi2U4sAGhnAnBwWMxiW/fZKwYq789hPPHi+o5Vr2V7Metp8GiaukfJIgn2JJojjO6AJIIaV2ixqfyJSSkNVmkXwvlY3zzrptY9dawkf6oZYg+T7AFpEUQOt2m9XBxU8+0iae1V2sa95lEswjIsy1It3jzZZ2x6q0gi/PoIZwlxBO8g2ueM5c4hG9qJIXDEhFsrbkpbeUpfoXhGXpPS8eVj0JJ947Eqjf5w3Pz6AZJIZ5gb+gCW0trIXQS80DuWWEJbH1iaY/sfqq1lwWId2NoxGfPPB/ZwgE1H3LQbHjjdyBWvfUGq7Jk5SkDpj9dvj/QlsELRbOiEdKCWaFy2o5uhRH/MoAc2Tat1ik1jUI3Aq4LYs2tIfXeI7vvzTNot5xIn7G8t4q4hqwwUq8l86Mro3aGBc95dsBozXoTLEj6yPbfqFxqtlLrtELZAjBbPCcUw+ZBHcFtBwDMdKkBxBMAEMhvFAHEEwAAANgtJfuQlgxugSg1AAAAANx2AAAAEE8AAIjF/wMAAP//asJejAc3Jj0AAAAASUVORK5CYII="
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"### f1-score\n",
"\n",
"\n",
"**f1-score** is the weighted harmonic mean of precision and recall. The best possible **f1-score** would be 1.0 and the worst \n",
"would be 0.0. **f1-score** is the harmonic mean of precision and recall. So, **f1-score** is always lower than accuracy measures as they embed precision and recall into their computation. The weighted average of `f1-score` should be used to \n",
"compare classifier models, not global accuracy.\n",
"\n",
"\n",
"Mathematically, f1-score can be given as\n",
"\n",
"\n",
"![f1-score.png](attachment:f1-score.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Support\n",
"\n",
"\n",
"**Support** is the actual number of occurrences of the class in our dataset. It classifies `284315 transactions as genuine` and `492 transactions as fraudulent`."
]
},
{
"attachments": {
"fpr-formula.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAANkAAABPCAIAAADp4/d/AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAPJSURBVHhe7d0xUuMwGIZhs2dJKBhOEE6Q0GxFSxfKpNmOkm4bKElHS7XNwgnwCRgKkrtkjaPIkoMde21Fv5T3aeIxngQ8H5L1W4pP1ut1AgjwQ70CvpFFSEEWIQVZhBRkEVKQRUhBFiEFWYQUZBFSkEVIQRYhBVn0YfVwcVLl4mG1Oej1Ru0p0QfEhixKk34s89fV53v+uiOdD+MMJFmUZnQ2VFt1skDevKrtWJBFz6Yva9vbbKB+pBXHLO9Hal+SLP5EFkayGJTB7KlI4/tnXP00WQzM4PRcbUWHLAamckgTPrLo2WKiSjVbtUOS15vhPFXbo6vLnSvLoJHFABR5nSzUrmxAc7s7yAkbWRSmWUlndL98HKvtaJBFz8o1nW9KOrYshg2OChFZDICV1yhjmCOLkIIsQgqyCCnIIqQgi5CC73aCFLSLkIIsQgqyCCn6yGLPK4mqjtwzhQWhc90utl9JVDNBL5+vQiBj5TqL/a8kWkxIY5x6zmK/K4nMd3uZqp3fH4rweR27tFlJNH4s0uh00dHm0hQmdWoc85rFViuJjOvI89OqaVN1w6iN+g4+O0BtwXCY0+I5i41XEq0erouFHo2uQRGanrPY70oi492MA++fop1NeuTU0KALcwhSth1+1B3z5fsBTVk+v94x9VEwqFPjmOM+OsCVROrEwKBOjWM9Z7Fc0+m2kki9W9FSpvPrWL/wDT7GLlZem6wkGszedD0nnf+uuwLtOo6GR57H0U2Nf+ri4uKOpjFOgWQxGf8qOurnv4QxRqFkMRlcXhVXjZX9dNadq86/SnxftxCNYLJohpE70lEKJ4tZm3dbXDQSxgiprgtVGtXejVlEFkfF+aqPK9lULOw/oFx1y+h3O8SthBoBtYsCBfHMgeXH9vbpF8HTP8liF0E+c0Ds9Q1ZbGGnf+s0U3iHLtTvbUjHj+ozckaPbf+Gds1gNJ1ufiGpFVqy6Iy0Zw6kyZmqROy5eeULWXRI2jMHTnUlQmTTSBYdajxT+GD07SuJTSNZbCGCZw4URVp5TSNZ7FmrZw7Y04p0dPNKkNbzGFxu00gWOzjMTOG+hz3FmEpY00gWWyjXdLrNFG6qetXj/9IdtaymkSz2zMrrvhja04p0QdK+F+diZpHuqCUVvsnicdJNY9ZPJ2fqf8Azsnistk1jOr97znd4RxaPVnHVmJqTJ/whi0fMWLghAVmUQ49kWo68h/p6r+23u1i3zL0X5HmOAaSgXYQUZBFSkEVIQRYhBVmEFGQRUpBFSEEWIQVZhBRkEVKQRciQJP8A/URylqZJTvYAAAAASUVORK5CYII="
},
"tpr-formula.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAPQAAABSCAIAAADsCG5aAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAANCSURBVHhe7drtleIwDEBR6qIg6qEampliGD5MLDuy4xhDFPHuvw0WYXMeWWdmD1fAKeKGW8QNt4gbbhE33CJuuEXccIu44RZxwy3ihlvEDbeIG24RN9wi7i2dDi1Ol/vaS2Hx8fz3eC/MEPeWVsT9dz6GP2ooXEPcWxoW983z/g6BuM0Q+w4lVBF3fDUpnrpzxG1GR9zJYbYmOeI2oytuMUXcOeI2gzv3aMRtRkfcYoS254jbjOa4VcrIzyNuM96Im7u2irjN6IubrsuI24yOPTeqiNsM4h6NuM0g7tGI2wziHo24zSDu0YjbjJiv9hMQ4l6PuOEWccMt4oZb/XHHx5+a5/5QPCsl5pvL0kp2mljtK3GLhyGFLLy+ksKxhoG4b6ZmF1eSN5oN2nN3/Iw26fh1WF2ZblaoG422izs5/NqaFFbKE/Df4NBoy7iVYksri9FLYlFBeTYXBlAVLpZVu7hzy2orN+5hcYfVaBAumUnbxS1GYrFLfVb3JMS9gXDJTPp23Ko4UltZDXuocEI0CJfMpO3jTpq1EfdNOCeqwsWyatO457mKleF95OxX88bufTtu5VVJXVl/86j0HYoWzg5f9hC3fPfazZu4kdhF3PJwpW7iRmIfcSfdUija7CRu6sZ6g+KO7WnbhgFxL31/gJlBcQP2EDfcIm64Rdyt4pa/5vk4IJ4PEp/8Fat85M5N500XKc8u0yf/5Ef9FuJutSLuWmcfy6Z60lfG+bduljdx/6Rhcd8ot8y3Nd25Z/+k5J+EuCEiUUIVncVXk/ha6p4G2jpTT5p7fe7j6aS/OXGjI+7kcEs6n4z7cDpPy5PVxI2uuMXU9nFf9I6JG11xi8MW4hYTcYC40RG3GGkrpz/umelTJHFrKRM3muNWKSMPS3MPPcOluMVQOEDceCPuSjVNcZfmK8NxJI9bjD0XETf64l4Kpilu5XwPYri0RIs7O0bcaI673Nmi7J66pDvuOHk7SNxwFXc8fDxf1p3UNuLu4ixuca4jcf86b3Enf6MH4v5Z/uKW83fE/bNiCVoEY+Je6d2407w9tE3c8Iu44RZxwy3ihlvEDbeIG24RN9wibrhF3HCLuOEWccMt4oZT1+s//0PZRPTTOpwAAAAASUVORK5CYII="
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC Curve\n",
"\n",
"\n",
"Another tool to measure the classification model performance visually is **ROC Curve**. ROC Curve stands for **Receiver Operating Characteristic Curve**. \n",
"\n",
"\n",
"The **ROC Curve** is plotting the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold levels.\n",
"\n",
"\n",
"**True Positive Rate (TPR)** or **Recall** is defined as\n",
"\n",
"\n",
"![tpr-formula.png](attachment:tpr-formula.png)\n",
"\n",
"\n",
"It is also called **Recall**.\n",
"\n",
"\n",
"\n",
"\n",
"**False Positive Rate (FPR)** is defined as\n",
"\n",
"\n",
"![fpr-formula.png](attachment:fpr-formula.png)\n",
"\n",
"\n",
"\n",
"The **Receiver Operating Characteristic Area Under Curve (ROC AUC)** is the area under the ROC curve. The higher it is, the better the model is. \n",
"\n",
"\n",
"In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various probability thresholds."
]
},
{
"attachments": {
"precision-formula.png": {
"image/png": ""
},
"recall-formula.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAMkAAABBCAIAAAD0RxfnAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAABOWSURBVHhe7Z15VFVlF8bJoVLCEbFScSoxzVJxmRJiDohprnQZqKhpjsUgTmlACFLOaJBDKSrOpoRDDmiKmrNlqKCS4hSUgWKKgppZfD/Z13fdT7yXywx2nj/e9c7nPWc/59l733tRswwNGgoGGrc0FBQ0bmkoKGjc0lBQ0LiloaCgcau4459//qG8e/fumjVrKlasmJiYSPOvv/7KHCzW0LhV3PH3339TxsbGenh4WFpaatzSkA/4999/79+/n56eDpPCwsKaNGnSvHnzhIQElOzevXtMoMIE6qJt1KUJHaWnaKFxq2Rg6dKljRs3njBhgq5dEqBxq/gC3bp9+3ZycnJqampQUFDDhg2XLFmSlJR0584dlIlRJOrWrVuMit9E4WjeuHEjLS0NAZNNihAat4ojhBmQBrlydna2tbWtXLlyhQoVunbt2rZt2/Dw8GvXrkGjDRs2TJo06b333lu2bBnzZ8+eHRAQ0LdvXz8/v23btmXupEsFigQat4oj0CRKZOnixYsHDhzw9fWFUsRb1KOiouLj4xGq8+fPjxo1Kjo6unfv3tbW1p988snZs2ePHTt26NAhHx8fFxeXlJQU2Up2K3xo3CoBCA0N7dGjxwcffKBrZ2TgK+Pi4nCU1KdPn04o5u7ujiuUUVwnRIyMjBT9Kyrp0rhVfIHeCDkGDBjQoUMHXB4JI4Arf/75Jx4ToWLU1dV1+PDhKBx1mT9//vxGjRqhavhN1Vn40LhVAtCrV68+ffqcOHFCmvo+7sKFC02bNh00aBB11R8cHFyjRo0xY8aIkkmkX/jQuFVMoRzZmTNncHkokzQFMkr59ddfv/HGGwRb0g9QqfHjx1erVo0AH5GjR/OJGv4P4sgohT0jR46kqRRIJCo9PZ2ssE2bNrNmzZJ+Juzdu5fMkSXkkkwrKmIBg9ziWBxUHLyAtEWBJqNFdW7OxkOXU6lggsPI2YrwYPkIoRG3g79zcHDw9/eX/lWrVp0+fVrqoHnz5kOGDNm9e7e6a29vb4Iz4SJQz6fwkVfdegKsWDwh3OLlsbe3R4e2bNlCMyEhgQh9//79mVMybty4UatWrcGDByu2EX5BLGdn58OHD8sORWggY7p18+bNq1evIq2ClJQU0hOpX7lyBUHWTS108Lzu3LnDYXi4VGhy2rt373IwOsnP5bGWaKBYlNxUu3btEKFdu3YlJydv3LiRiP769esMccvHjh2rWrUqPnHdunU8Cozi5eXl4uKCG2VCkb/2BrnFDXBQCwsLKysrbqBmzZp2dnadOnVydHRs27ZtgwYNiB/1xblwwAOl/P3335cvX06a3axZMzLzS5cuofzbt2/nFbexsZkwYYJ6s0suxJdR4gTxeq1btw4MDKSuXmleexwlBvL09Jw3b97AgQOxCHOOHDmC2hUHf2KQW5wvOjp6xowZ3bp1K1269Mcff7xp06Y9e/bs2LEjKioKiXZycqpbt+7MmTN1CwoRaNWvv/66du3aChUq8ALExsbCOQjHy/3yyy9z4M2bN8tM4WKJRlpaWkxMDE/+1KlTSUlJut6MjLi4uLfeesvW1ha78Hbt27ePafhEueVizS0ByhwQEGBubh4REaHreoht27ZBrxYtWiDUcieFcz+KLpcvX65YsSKhLtySHtCqVSvEVX2b9kRCpIvYq3bt2oRWcE76FYowfteHQW5BFKIW3gMUq1SpUrgeQhmohmYgaZTMmTNnTrly5Vq2bCmf/6ooB/MDdlCl9CtkjuuGpKIbeAjpF2TdgYiKksvhrEeMGHHy5EnpB7zKpOWRkZHSfGRhSQS3z2PnmXO/kEbd0fHjx5FtvGFqaipNLMIEUHxu2Ri3KHk5xo0bV6ZMGcLDrF8gIGkMEeLI7XFj0l/Q4EFT4hYrVark5uamuMVjJS4hmH2SuPUIuCNuH+lasWKFmZkZwQkW0edc8UH23Bo/fjzx1ldffUUmQic3JkPcD0PcnqurqxIS7tDQTcoqkO1TUDMfgVpI9kSZlVsgW26xeebrbSoMHaYIQVg5evRo4gEePgnNN998Q6ypGytOyIZbiYmJ+ES4RYYi/Qo4oyZNmnTv3v3cuXMPCPXwi1VAbjxlyhRvb29KpuFPRfMkrxYQbkNKcjofHx8SHPmqVeHMmTOEqGQSQUFBZEArV678448/dGN6ulW5cmV3d/dHuNWrV68nWLcAEf3Ro0d/+OEHkip4RiAv73ZxQzbcwn5+fn5ly5bt169fWFgYYfvSpUvpAXDi888///nnn9V8gKKQXQ4fPhzShIaGwonJkyd37dqVOqGbkO/mzZu+vr4LFiyAecuWLVuyZAn8Cw8P3759u2wFWZFJCEcmyEsJwxAn/K+il0R7ueMW9uBIo0aNGjlyJKURjBkzhjNwMN1KA+AScL0wobvw/4N+CcsKB7qrGkX23PL394dbjo6O0CUkJOTTTz9t3779U089hbHxkswRNZIyPj4euWrTpo36LREYO3Zsz549Fy1aRB0Bg6NwIjAwUEYBTOrcuTMSKM3FixfDzqFDh0oTELHa29svXLhQTiWJUu64tXz5cpymnZ0dGSWTjcDBwYE7hdm6lRpyiOx9IrE8PhFCSI8AyXn66ac9PT3VJy6iJYSWJC/Tpk1j4fWH2LdvH7mkGAlCVK9eHart3buX5oMo9P59JBBu4XwfbJSRgVz16NEDrWKI5BRMnToVS/fp00cYLB42d9zKdxCT8RD+a9DdvFGYyi1cmDh1sRa+r0OHDvXq1UOTHszOBCafM2fOM888g3VRLxYyilv57LPP3nnnHZwpcwgOUMFhw4ahcDTlKogcEqg+cZYvbXCdBw4c+PDDD9kB5j3//PPdunUrVtwS13Dq1CnuCJQvX14q/wXIEzCObLilnydibEwlORrAZ1laWhKHSROcPXsWJXvuuefWrFnDEz9+/DhB/YkTJ6j/+OOPkJ3lq1evLlWqFPHW5cuXWaKvhfog9urSpQt+E3qxA+Fds2bN8FDq6oCzFS235PCpqakILSAyk8p/AfIEjCN73YJbsEG4RY8oB97Kw8PD3NwcBkg/88+fP09MRidOkJ6s4EUnu0HYUDVJm8X2qFTmuA4RERE4RBsbm/Xr10sPEf3bb7+NW+S6bCJ/ZUBqmQtu5Xssr8EQjHELw+B35LPT+fPni24JtwDcgnNOTk6nT5/G5LKEaYgc+ePVq1fpwdmJ44iNjUVmqPz2229wq3///vI9N7sJiQmqZAno1KlTx44dhViyfO7cuRCLTg5AoHbw4EEqeNVKlSp99NFHMTExwiG2Em4R9sm2WZHvsTyXRk3/a9DdvFFko1uk/QTa+vGW4hYOi6i8UaNG69atU/07d+4kMMIk8mcCIkg8fS8vL3agfu3atfr169va2n777bc0FVi4YsUKqWPUvn37HjlyRJosJ3Vo2LAh0kUTzZOUE95XqVJlxIgRv/zyS+bEB3jzzTcJ+b///ntpCuc0FAmy4RZyQiSO0oSEhEgmqPwXwtO7d28CWAxPc/PmzcgJoRXzn3322e+++06mAZzXypUrhUzo0KZNm6pWraqyQgFulKBK6u3atYMiRGbSRM9mzZr10ksv4X9pEqsJt/CMVlZW7APJMic+AGsHDBggSSjQuPVY8FgA3kZgSOPzCIPc4tp4FjxOnTp1zMzMMC2RjTgj6MWBqKxatcrZ2blp06bEZDADNqBtBOkoEDENrmf69On4FPoJ6qGp8JISqSOAw3lNmjQJokBEIqRz587JpUkFcLgkoUQ806ZNQ4TwesyvXbs2CeaGDRv2798fFhbm6OiIU65VqxZEnDdv3o4dOwYOHIj7hnBt2rRBVjkku0mpwLF5lKaD+bqVpoEluIycrnoiYZBbgOQuNDQU80+cOBE1oi5BkmL6lStXCI0hEPxAt/R9E9E3C0kokCuoI2GTvpkPHz7MtrBz27ZtW7duJQ/QDWRO27NnD+TAjbIDIoSp6ORCyCcqRWoG4Yjq/P39AwICKLl6dHT0zJkzoTjXhdkopRxSSg0KPEzAq75kyRIyFQxx9OhR3Vi+whi3jIPz6Wp6oFM5zUegbGzI2IYWlixcv3599+7d+p8F5iPy/p7IQybqXbZsGYmXq6sr4engwYNlNH/fQ2Pc4kqoBacRKLlS4NnJBEr9Uf1+OkHWp0ynoQmGtpUeRgGdNBVkB0Z17UL//YJc7tatWxEREeXKlZNPhk3MpwoT4kCIH8zNzZOTk7/44gtra2v1x488Q6nkC3KvWxr0AZsp8S8oQeXKlSV2LG7c4t2jJN8aOXJktWrVqKOvcXFx8gllvkPjVl4hepmeng6TcDSNGjUit7hw4QJKpqJMcO/ePUr9JnSUnmwhn+/Exsb2799ffgwiVM4pyPTZauPGjfjBxo0bZ92EM8vZ5OTSZBqgJ6eqpnErP4HZbGxsgoODde0syEomU3y3WHrXrl1k6+JtpSd3ID0ijyYH17UNMPWxnTmCxq3cQ95jbHDx4kWSWXLbcePG1apViyCGNJZ8Fo/DnKtXr5Jfk4uRVjMfd4lXOnToEM5IfqQEjEuCMOngwYOvvPLKpUuXqIuSmQjZHFpzhgMHDri7u9etW9fNzY0MPSEhQb74l2kkjxyPCx07dowe8hKaMTEx5OAszCmhNW7lHiJC2IZkHi/z6quvVqlSpXz58k5OTm3btl21ahWsgj04ysmTJ3fp0sXb2zs1NXXhwoWBgYH9+vXz8vIKDQ0Vw5vCrf3799evX19+oJsjM8s5cdlQv1evXkRaZBs4bgcHh/DwcPnRgGDlypVTp07t2bPna6+9xpsQGRk5d+5cPz8/T0/Phg0b/vTTT7p5pkHjVl6BU0Of4BAM69atW8uWLXndSfIpGcWFDRo0CDvhgypUqODi4nLy5ElGmT9z5kx8k3w5JvtIJSvyyC0AdwExO+eB0+3ateMAHCMtLY3rKmZDehSXObwh9vb2GzZsQHFZlZiYaGdnN3DgQPnDz6ye/bHQuJVvCAoKQpyGDRuma2f+epugfs2aNdRDQkJatWqF8RQtli5d2qBBg0WLFsmHYfoGk9gZpRGwD+Xu3bvr1at35swZ1SNgZraxkSLu8ePHhwwZ0rVrV2kqSIy/Y8cO2AaJzczMkCuuJaPsz2vg6OiIsNHUuFVIwGzyrEeMGNGhQwe4gp0wBp3YCUhmhzPCqAQ0mYseAM6hQ0OHDpXObEMoQh8ShWxp9FiIMnGkL7/8Eoq8++67MJIrAqEd2xICJiUlUVm3bp25uTkBlr6Ucun3338fflPX7zcCjVt5gvImPG7Y07t3b/0vrxTwgwQ3rq6u1NWSefPmEfogDxJN63OLYB8JQUgE8r3ZlClTXnjhhQULFtATEREhQ/Rv2bJFvoszArko5ahRo9q3bz969Gjpz6pAaBVx4YsvvihN3g1KAkdCSQ6ckpLCJhq3CgPKZiRWLVq0gFvSLxAbUM6aNQtujR8/XvpBeno6rLKwsFi/fr1Ikb6Zif1bt25do0YNbAygVM2aNS0tLcuWLWtlZUU/PTJkbW2NY502bRqr5DCPhWKDs7Nz586dySekqb+EY9DkPHhM7kX9nA6QOZYuXVr/ty2mQONWniCEwCQSmJPeSyeRsoqrsBkWxV2qz72wNGKDCSGQyNUjSoBaoBDJyckkAYBQmnLTpk116tSRzzKkhwkAEqjPEQxB7Q/7kU/cnzT1ryvcmjFjhq2trYeHhygWIGQcMGBA9erVkVLpMREat/IEJTbkUCRfEx7+nyj+/v7qN2SgWbNmbm5uWFTF3WPHju3YsSPSJROy+qasiI2NRaJgra5tMpQ4sRZBUg4RPMItSnwuEqv/w10Cr7p16/J6UI+KigoLC5P+bKFxK09QnECBCHWFT+RihMxKG9AVbNO9e3f1UxYSMYiFhJw9e1YpnwwJ6ETPIKIAWaLcs2cPeSJLVI+CcWrK5pRcF27Jn2Y9ckVAD9v6+PgQkMlfJAjbuKny5cvLr4spvby8Hsw2ARq38gR5+lilT58+Y8aMmTt3LgTauXNnQkKC/EkcE6Kjo3EoqBop5JEjR2hiXZLKR37VbQSwhzKPn51C1pCQEN4B+Yd39emoeIbXI5CX344DuQU4jV6SPdy8eTMyMnLr1q1ZeflYaNzKE5Tj2LdvHzGKvb09bnH16tXqby2JigjC5DfcixcvRtuI6Am95aMjEymSlVv6SaWJuH37No6bQJ4XgKa+N1Rcge6+vr5Il4zezfwdB0nrsEygWMR89BiXSQWNW/kDjIFDuZb5XziJSUTS4uPjW7ZsiWghZnfu3IFqvP3yPWPmOpMg3CLuQT9yza3U1NTGjRsTbIlr1ueWAifnrVAvhgAmEdezHORILzVuFRTEQgRblpaWffv2jYmJkX4F0+klFsU3NW3aVD4/y9bGanMOQF2+gHr99dfVn47miNz6MH2hxq38ATKAUCEnvOVAGSAuLq5MmTLh4eHKyzCNCY+VDUNgPmViYuJXmf8KmuoxArV/QEBAUFDQ2rVrUVA7OzsISqeh5XIXQNfOBPcinSBHx9a4VSDAHqRaRCqEVmZmZoTPSUlJOXWFeYG6kIuLi42NDTEfZwgODpaf6MASGS1QaNzKZyijRkVFEd1bWFjALSsrqxUrVohdc00vFprOCSUws2fPnjhxIifp2LGj/IVpLmK13EHjloaCgsYtDQUFjVsaCgoatzQUFDRuaSgYZGT8D836qDXIJPz1AAAAAElFTkSuQmCC"
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision - Recall Curve\n",
"\n",
"\n",
"\n",
"Another tool to measure the classification model performance is **Precision-Recall Curve**. It is a useful metric which is used to evaluate a classifier model performance when classes are very imbalanced such as in this case. This **Precision-Recall Curve** shows the trade off between precision and recall.\n",
"\n",
"\n",
"\n",
"In a **Precision-Recall Curve**, we plot **Precision** against **Recall**.\n",
"\n",
"\n",
"Precision is defined as :\n",
"\n",
"\n",
"\n",
"![precision-formula.png](attachment:precision-formula.png)\n",
"\n",
"\n",
"\n",
"Recall is defined as :\n",
"\n",
"\n",
"\n",
"![recall-formula.png](attachment:recall-formula.png)\n",
"\n",
"\n",
"\n",
"The **Precision Recall Area Under Curve (PR AUC)** is the area under the PR curve. The higher it is, the better the model is."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Difference between ROC AUC and PR AUC\n",
"\n",
"\n",
"- Precision-Recall does not account for True Negatives (TN) unlike ROC AUC (TN is not a component of either Precision or Recall). \n",
"\n",
"\n",
"- In the cases of class imbalance problem, we have many more negatives than positives. The Precision-Recall curve much better illustrates the difference between algorithms in the class imbalance problem cases where there are lot more negative examples than the positive examples. In these cases of class imbalances, we should use Precision-Recall Curve (PR AUC), otherwise we should use ROC AUC.\n",
"\n",
"\n",
"So, we can conclude that we should use PR AUC for cases where the class imbalance problem occurs. Otherwise, we should use ROC AUC.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Precision - Recall Curve \n",
"\n",
"\n",
"In the previous section, we conclude that we should use `Precision-Recall Area Under Curve` for cases where the class imbalance problem exists. Otherwise, we should use `ROC-AUC (Receiver Operating Characteristic Area Under Curve)`.\n",
"\n",
"\n",
"Now, I will compute the `average precision score`. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average precision-recall score : 0.47\n"
]
}
],
"source": [
"# compute and print average precision score\n",
"\n",
"from sklearn.metrics import average_precision_score\n",
"\n",
"average_precision = average_precision_score(y_pred, y)\n",
"\n",
"print('Average precision-recall score : {0:0.2f}'.format(average_precision))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Precision-Recall Curve` gives us the correct accuracy in this imbalanced dataset case. We can see that we have a very poor accuracy for the model.\n",
"\n",
"\n",
"Now, I will plot the `precision-recall curve`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x6b5ed6a940>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from sklearn.metrics import precision_recall_curve \n",
"\n",
"precision, recall, thresholds = precision_recall_curve(y_pred, y)\n",
"\n",
"# create plot\n",
"plt.plot(precision, recall, label='Precision-recall curve')\n",
"plt.xlabel('Precision')\n",
"plt.ylabel('Recall')\n",
"plt.title('Precision-recall curve')\n",
"plt.legend(loc=\"lower left\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Random over-sampling the minority class\n",
"\n",
"\n",
"\n",
"**Over-sampling** is the process of randomly duplicating observations from the minority class in order to achieve a balanced dataset. So, it replicates the observations from minority class to balance the data. It is also known as **upsampling**. It may result in overfitting due to duplication of data points. \n",
"\n",
"\n",
"The most common way of over-sampling is to resample with replacement. I will proceed as follows:-\n",
"\n",
"\n",
"First, I will import the resampling module from Scikit-Learn."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# import resample module \n",
"\n",
"from sklearn.utils import resample"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will create a new dataframe with an oversampled minority class as follows:-\n",
"\n",
"\n",
"1. At first, I will separate observations from Class variable into different DataFrames.\n",
"\n",
"\n",
"2. Now, I will resample the minority class with replacement. I will set the number of samples of minority class to match \n",
" that of the majority class.\n",
"\n",
"\n",
"3. Finally, I will combine the oversampled minority class DataFrame with the original majority class DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# separate the minority and majority classes\n",
"df_majority = df[df['Class']==0]\n",
"df_minority = df[df['Class']==1]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# oversample minority class\n",
"\n",
"df_minority_oversampled = resample(df_minority, replace=True, n_samples=284315, random_state=0) "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# combine majority class with oversampled minority class\n",
"\n",
"df_oversampled = pd.concat([df_majority, df_minority_oversampled])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 284315\n",
"0 284315\n",
"Name: Class, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display new class value counts\n",
"\n",
"df_oversampled['Class'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that we have a balanced dataset. The ratio of the two class labels is now 1:1.\n",
"\n",
"Now, I will plot the bar plot of the above two classes."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x6b5f3a2a90>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAC1BJREFUeJzt3VGInflZx/HvbxPihS29MIPUJNMJNiBRi8Ux61UVXTGhkAiukIDQlcpQaKhaL5qihDbe6Ar2KheNuFKEmq69GutIwGovRLZmVpdKNqQdwmqGgKZ2WRGxadzHi0zbw+lJ5j0zZzKbJ98PBM7/ff+c82wYvrx555yzqSokSb08tdsDSJJmz7hLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWpo72698P79+2thYWG3Xl6SHksvv/zy16tqbrN9uxb3hYUFVldXd+vlJemxlORfh+zztowkNWTcJakh4y5JDRl3SWrIuEtSQ4PinuR4khtJ1pKcm3D+uSR3kryy8efXZz+qJGmoTd8KmWQPcBH4BWAduJpkuapeHdv6uao6uwMzSpKmNOTK/RiwVlU3q+oucBk4tbNjSZK2Y8iHmA4At0bW68DTE/b9cpL3AV8Ffquqbo1vSLIELAHMz89PP+0uWDj3V7s9Qiuv/f77d3uEPj7xjt2eoJdPvLHbE8zUkCv3TDg2/n/V/ktgoareA/wN8JlJT1RVl6pqsaoW5+Y2/fSsJGmLhsR9HTg0sj4I3B7dUFX/WVXf3Fj+MfCTsxlPkrQVQ+J+FTiS5HCSfcBpYHl0Q5J3jixPAtdnN6IkaVqb3nOvqntJzgJXgD3AC1V1LckFYLWqloGPJDkJ3AO+ATy3gzNLkjYx6Fshq2oFWBk7dn7k8ceBj892NEnSVvkJVUlqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDQ2Ke5LjSW4kWUty7iH7nk1SSRZnN6IkaVqbxj3JHuAicAI4CpxJcnTCvrcDHwG+POshJUnTGXLlfgxYq6qbVXUXuAycmrDv94Dngf+d4XySpC0YEvcDwK2R9frGse9I8l7gUFV9YYazSZK2aEjcM+FYfedk8hTwKeC3N32iZCnJapLVO3fuDJ9SkjSVIXFfBw6NrA8Ct0fWbwd+DPhSkteAnwaWJ/1StaouVdViVS3Ozc1tfWpJ0kMNiftV4EiSw0n2AaeB5W+frKo3qmp/VS1U1QLwEnCyqlZ3ZGJJ0qY2jXtV3QPOAleA68CLVXUtyYUkJ3d6QEnS9PYO2VRVK8DK2LHzD9j7s9sfS5K0HX5CVZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8ZdkhoaFPckx5PcSLKW5NyE8x9K8i9JXkny90mOzn5USdJQm8Y9yR7gInACOAqcmRDvz1bVj1fVTwDPA38080klSYMNuXI/BqxV1c2qugtcBk6Nbqiq/xpZfj9QsxtRkjStvQP2HABujazXgafHNyX5MPBRYB/wc5OeKMkSsAQwPz8/7aySpIGGXLlnwrHvuTKvqotV9cPAx4DfnfREVXWpqharanFubm66SSVJgw2J+zpwaGR9ELj9kP2XgV/azlCSpO0ZEverwJEkh5PsA04Dy6MbkhwZWb4f+NrsRpQkTWvTe+5VdS/JWeAKsAd4oaquJbkArFbVMnA2yTPAt4DXgQ/s5NCSpIcb8gtVqmoFWBk7dn7k8W/MeC5J0jb4CVVJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0NinuS40luJFlLcm7C+Y8meTXJV5J8Mcm7Zj+qJGmoTeOeZA9wETgBHAXOJDk6tu2fgcWqeg/weeD5WQ8qSRpuyJX7MWCtqm5W1V3gMnBqdENV/V1V/c/G8iXg4GzHlCRNY0jcDwC3RtbrG8ce5IPAX29nKEnS9uwdsCcTjtXEjcmvAovAzzzg/BKwBDA/Pz9wREnStIZcua8Dh0bWB4Hb45uSPAP8DnCyqr456Ymq6lJVLVbV4tzc3FbmlSQNMCTuV4EjSQ4n2QecBpZHNyR5L/Bp7of9P2Y/piRpGpvGvaruAWeBK8B14MWqupbkQpKTG9v+EHgb8BdJXkmy/ICnkyQ9AkPuuVNVK8DK2LHzI4+fmfFckqRt8BOqktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1NCguCc5nuRGkrUk5yacf1+Sf0pyL8mzsx9TkjSNTeOeZA9wETgBHAXOJDk6tu3fgOeAz856QEnS9PYO2HMMWKuqmwBJLgOngFe/vaGqXts49+YOzChJmtKQ2zIHgFsj6/WNY1NLspRkNcnqnTt3tvIUkqQBhsQ9E47VVl6sqi5V1WJVLc7NzW3lKSRJAwyJ+zpwaGR9ELi9M+NIkmZhSNyvAkeSHE6yDzgNLO/sWJKk7dg07lV1DzgLXAGuAy9W1bUkF5KcBEjyU0nWgV8BPp3k2k4OLUl6uCHvlqGqVoCVsWPnRx5f5f7tGknSW4CfUJWkhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1JBxl6SGjLskNWTcJakh4y5JDRl3SWrIuEtSQ8Zdkhoy7pLUkHGXpIaMuyQ1ZNwlqSHjLkkNGXdJasi4S1JDxl2SGjLuktSQcZekhoy7JDVk3CWpIeMuSQ0Zd0lqyLhLUkPGXZIaMu6S1NCguCc5nuRGkrUk5yac/74kn9s4/+UkC7MeVJI03KZxT7IHuAicAI4CZ5IcHdv2QeD1qno38CngD2Y9qCRpuCFX7seAtaq6WVV3gcvAqbE9p4DPbDz+PPDzSTK7MSVJ09g7YM8B4NbIeh14+kF7qupekjeAHwC+PropyRKwtLH87yQ3tjK0JtrP2N/3W1H8N92T6LH42eSTj8316LuGbBoS90n/xbWFPVTVJeDSgNfUlJKsVtXibs8hjfNnc3cMuS2zDhwaWR8Ebj9oT5K9wDuAb8xiQEnS9IbE/SpwJMnhJPuA08Dy2J5l4AMbj58F/raqvufKXZL0aGx6W2bjHvpZ4AqwB3ihqq4luQCsVtUy8CfAnyVZ4/4V++mdHFoTebtLb1X+bO6CeIEtSf34CVVJasi4S1JDxl2SGhryPndJGizJj3D/U+sHuP95l9vAclVd39XBnjBeuUuamSQf4/5XlAT4R+6/lTrAn0/60kHtHN8t00ySX6uqP93tOfRkSvJV4Eer6ltjx/cB16rqyO5M9uTxyr2fT+72AHqivQn80ITj79w4p0fEe+6PoSRfedAp4Acf5SzSmN8Evpjka3z3CwfngXcDZ3dtqieQt2UeQ0n+HfhF4PXxU8A/VNWkKyfpkUjyFPe/KvwA938m14GrVfV/uzrYE8Yr98fTF4C3VdUr4yeSfOnRjyN9V1W9Cby023M86bxyl6SG/IWqJDVk3CWpIeMuSQ0Zd0lq6P8BshdYHBwkBd8AAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# view the distribution of percentages within the Class column\n",
"\n",
"\n",
"(df_oversampled['Class'].value_counts()/np.float(len(df_oversampled))).plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above bar plot shows that we have a balanced dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will train another model using Logistic Regression and check its accuracy, but this time on the balanced dataset."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 93.76%\n"
]
}
],
"source": [
"# declare feature vector and target variable\n",
"X1 = df_oversampled.drop(['Class'], axis=1)\n",
"y1 = df_oversampled['Class']\n",
"\n",
"\n",
"# instantiate the Logistic Regression classifier\n",
"logreg1 = LogisticRegression()\n",
"\n",
"\n",
"# fit the classifier to the imbalanced data\n",
"clf1 = logreg1.fit(X1, y1)\n",
"\n",
"\n",
"# predict on the training data\n",
"y1_pred = clf1.predict(X1)\n",
"\n",
"\n",
"# print the accuracy\n",
"accuracy1 = accuracy_score(y1_pred, y1)\n",
"\n",
"print(\"Accuracy : %.2f%%\" % (accuracy1 * 100.0))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a balanced dataset. Although the accuracy is slightly decreased, but it is still quite high and acceptable. \n",
"This accuracy is more meaningful as a performance metric."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Random under-sampling the majority class\n",
"\n",
"\n",
"The **under-sampling** methods work with the majority class. In these methods, we randomly eliminate instances of the majority class. It reduces the number of observations from majority class to make the dataset balanced. This method is applicable when the dataset is huge and reducing the number of training samples make the dataset balanced.\n",
"\n",
"\n",
"The most common technique for under-sampling is resampling without replacement.\n",
"\n",
"\n",
"I will proceed exactly as in the case of random over-sampling."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# separate the minority and majority classes\n",
"df_majority = df[df['Class']==0]\n",
"df_minority = df[df['Class']==1]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# undersample majority class\n",
"\n",
"df_majority_undersampled = resample(df_majority, replace=True, n_samples=492, random_state=0) "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# combine majority class with oversampled minority class\n",
"\n",
"df_undersampled = pd.concat([df_minority, df_majority_undersampled])"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 492\n",
"0 492\n",
"Name: Class, dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display new class value counts\n",
"\n",
"df_undersampled['Class'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that the new dataframe `df_undersampled` has fewer observations than the original one `df` and the ratio of the two classes is now 1:1.\n",
"\n",
"Again, I will train a model using Logistic Regression classifier."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 93.90%\n"
]
}
],
"source": [
"# declare feature vector and target variable\n",
"X2 = df_undersampled.drop(['Class'], axis=1)\n",
"y2 = df_undersampled['Class']\n",
"\n",
"\n",
"# instantiate the Logistic Regression classifier\n",
"logreg2 = LogisticRegression()\n",
"\n",
"\n",
"# fit the classifier to the imbalanced data\n",
"clf2 = logreg2.fit(X2, y2)\n",
"\n",
"\n",
"# predict on the training data\n",
"y2_pred = clf2.predict(X2)\n",
"\n",
"\n",
"# print the accuracy\n",
"accuracy2 = accuracy_score(y2_pred, y2)\n",
"\n",
"print(\"Accuracy : %.2f%%\" % (accuracy2 * 100.0))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, we can see that we have a slightly decreased accuracy but it is more meaningful now."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Apply Tree-Based Algorithms"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# declare input features (X) and target variable (y)\n",
"X4 = df.drop('Class', axis=1)\n",
"y4 = df['Class']\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# import Random Forest classifier\n",
"from sklearn.ensemble import RandomForestClassifier\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# instantiate the classifier \n",
"clf4 = RandomForestClassifier()\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,\n",
" oob_score=False, random_state=None, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fit the classifier to the training data\n",
"clf4.fit(X4, y4)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# predict on training set\n",
"y4_pred = clf4.predict(X4)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 99.99%\n"
]
}
],
"source": [
"# compute and print accuracy\n",
"accuracy4 = accuracy_score(y4_pred, y4)\n",
"print(\"Accuracy : %.2f%%\" % (accuracy4 * 100.0))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ROC-AUC : 0.9999990706517691\n"
]
}
],
"source": [
"# compute and print ROC-AUC\n",
"\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"y4_prob = clf4.predict_proba(X4)\n",
"y4_prob = [p[1] for p in y4_prob]\n",
"print(\"ROC-AUC : \" , roc_auc_score(y4, y4_prob))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10.\tRandom under-sampling and over-sampling with imbalanced-learn\n",
"\n",
"\n",
"\n",
"There is a Python library which enable us to handle the imbalanced datasets. It is called **Imbalanced-Learn**. It is a Python library which contains various algorithms to handle the imbalanced datasets. It can be easily installed with the `pip` command. This library contains a `make_imbalance` method to exasperate the level of class imbalance within a given dataset.\n",
"\n",
"\n",
"Now, I will demonstrate the technique of random undersampling and oversampling with imbalanced learn. \n",
"\n",
"\n",
"First of all, I will import the `imbalanced learn` library.\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# import imbalanced learn library\n",
"\n",
"import imblearn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, I will import the `RandomUnderSampler` class. It is a quick and easy way to balance the data by randomly selecting a subset of data for the targeted classes. "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# import RandomUnderSampler\n",
"from imblearn.under_sampling import RandomUnderSampler\n",
"\n",
"# instantiate the RandomUnderSampler\n",
"rus = RandomUnderSampler(return_indices=True)\n",
"\n",
"\n",
"# fit the RandomUnderSampler to the dataset\n",
"X_rus, y_rus, id_rus = rus.fit_sample(X, y)\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Removed indices: [ 5836 202223 199184 217162 156765 119722 9747 96096 46243 260470\n",
" 137149 9617 186704 172923 53824 264531 186954 148131 274313 50064\n",
" 87518 235078 91080 29077 141762 270962 270486 116652 263295 153042\n",
" 197270 105049 204953 53555 163533 115226 19172 128673 208375 107624\n",
" 20913 36306 90589 148390 18284 100420 211724 144616 279522 197900\n",
" 96253 13239 141860 114292 256405 119907 29937 24333 256844 226620\n",
" 41229 235689 75305 23532 13676 26830 132834 238893 131701 125698\n",
" 159945 196198 276122 10986 58949 7352 256147 168740 130487 98300\n",
" 161448 257972 117239 25459 123127 231101 77134 281171 50599 2473\n",
" 153956 31603 31999 283559 153970 251528 99568 129949 101736 190313\n",
" 118997 41125 31133 71329 160298 249008 126580 146635 230144 104093\n",
" 273563 267528 125426 39093 75909 117904 241892 56816 225206 33650\n",
" 5263 105116 211599 265515 21761 250814 29125 275722 44086 257401\n",
" 71780 132076 102377 214299 278445 268138 236507 95319 106287 273761\n",
" 142529 146819 135435 103856 141023 74960 181619 8311 168238 54159\n",
" 106336 229618 147865 200897 107247 255643 113321 66650 32975 20393\n",
" 180125 11994 256435 127328 41987 57728 90030 243368 41344 29823\n",
" 143290 38265 236063 217520 283773 150366 218249 122243 220355 189274\n",
" 27530 87580 221250 41056 107837 84258 50854 9063 189150 26381\n",
" 244868 55951 62812 207094 230748 192377 242018 178620 116709 188443\n",
" 208235 188563 44668 137090 197103 136151 175938 243614 267850 123543\n",
" 63129 164108 284346 218596 126940 259856 243974 207932 43625 257522\n",
" 196663 251069 173453 41363 257675 59752 212712 222609 52795 207421\n",
" 67880 47771 69700 115211 282664 246172 264648 241070 257206 75210\n",
" 238341 43663 16425 150562 63409 71794 213739 74121 128287 226199\n",
" 62370 8528 205549 112187 89638 71309 114325 84435 216889 228287\n",
" 130425 117058 280060 153016 205429 230355 20712 23043 122137 132379\n",
" 11661 176513 177958 94888 130708 36296 93731 14570 261935 251092\n",
" 252165 263671 162757 65270 277561 25114 100992 197227 44143 20917\n",
" 80714 278852 96638 90850 84163 6425 129699 136579 90116 106625\n",
" 188972 211079 101738 120161 40106 206659 258639 280200 126335 247343\n",
" 226440 130027 194130 278418 186462 82078 271763 35570 183145 271267\n",
" 279317 271526 252925 124545 190201 56605 18503 89200 166734 167539\n",
" 60858 226381 260764 82075 107619 10853 273426 191654 169236 178160\n",
" 95083 210584 63167 70200 226357 107241 160240 249633 253367 198873\n",
" 279417 37280 10344 209171 181792 135543 151665 133518 135882 212357\n",
" 280926 21195 218363 178371 215814 8014 32246 204065 22996 54619\n",
" 18406 151371 278843 251113 167432 175717 143750 253487 33354 17997\n",
" 167254 106361 20423 6917 18525 252277 96308 272383 281469 212870\n",
" 12459 18541 61863 257403 206573 167346 194616 165149 165896 59201\n",
" 10103 146561 3827 4515 140176 24129 89804 251580 111608 46719\n",
" 158238 7169 234934 121245 270536 135760 206508 26492 98444 248432\n",
" 81023 50882 277044 221110 177975 154267 257395 174818 149312 279392\n",
" 33443 205591 166024 106117 91340 95683 78512 184548 215361 129105\n",
" 74797 228909 182643 87382 1180 25699 247498 283012 192458 267131\n",
" 159556 65971 132476 133966 78890 67625 107435 48701 191604 158442\n",
" 117564 84450 129364 194944 100406 228483 76078 193637 284602 36073\n",
" 22438 57904 15759 83488 170064 191446 241572 65327 73440 96669\n",
" 240512 171764 206452 270464 251364 9144 182378 147063 128548 43986\n",
" 121838 210254 541 623 4920 6108 6329 6331 6334 6336\n",
" 6338 6427 6446 6472 6529 6609 6641 6717 6719 6734\n",
" 6774 6820 6870 6882 6899 6903 6971 8296 8312 8335\n",
" 8615 8617 8842 8845 8972 9035 9179 9252 9487 9509\n",
" 10204 10484 10497 10498 10568 10630 10690 10801 10891 10897\n",
" 11343 11710 11841 11880 12070 12108 12261 12369 14104 14170\n",
" 14197 14211 14338 15166 15204 15225 15451 15476 15506 15539\n",
" 15566 15736 15751 15781 15810 16415 16780 16863 17317 17366\n",
" 17407 17453 17480 18466 18472 18773 18809 20198 23308 23422\n",
" 26802 27362 27627 27738 27749 29687 30100 30314 30384 30398\n",
" 30442 30473 30496 31002 33276 39183 40085 40525 41395 41569\n",
" 41943 42007 42009 42473 42528 42549 42590 42609 42635 42674\n",
" 42696 42700 42741 42756 42769 42784 42856 42887 42936 42945\n",
" 42958 43061 43160 43204 43428 43624 43681 43773 44001 44091\n",
" 44223 44270 44556 45203 45732 46909 46918 46998 47802 48094\n",
" 50211 50537 52466 52521 52584 53591 53794 55401 56703 57248\n",
" 57470 57615 58422 58761 59539 61787 63421 63634 64329 64411\n",
" 64460 68067 68320 68522 68633 69498 69980 70141 70589 72757\n",
" 73784 73857 74496 74507 74794 75511 76555 76609 76929 77099\n",
" 77348 77387 77682 79525 79536 79835 79874 79883 80760 81186\n",
" 81609 82400 83053 83297 83417 84543 86155 87354 88258 88307\n",
" 88876 88897 89190 91671 92777 93424 93486 93788 94218 95534\n",
" 95597 96341 96789 96994 99506 100623 101509 102441 102442 102443\n",
" 102444 102445 102446 102782 105178 106679 106998 107067 107637 108258\n",
" 108708 111690 112840 114271 116139 116404 118308 119714 119781 120505\n",
" 120837 122479 123141 123201 123238 123270 123301 124036 124087 124115\n",
" 124176 125342 128479 131272 135718 137705 140786 141257 141258 141259\n",
" 141260 142405 142557 143188 143333 143334 143335 143336 143728 143731\n",
" 144104 144108 144754 145800 146790 147548 147605 149145 149357 149522\n",
" 149577 149587 149600 149869 149874 150601 150644 150647 150654 150660\n",
" 150661 150662 150663 150665 150666 150667 150668 150669 150677 150678\n",
" 150679 150680 150684 150687 150692 150697 150715 150925 151006 151007\n",
" 151008 151009 151011 151103 151196 151462 151519 151730 151807 152019\n",
" 152223 152295 153823 153835 153885 154234 154286 154371 154454 154587\n",
" 154633 154668 154670 154676 154684 154693 154694 154697 154718 154719\n",
" 154720 154960 156988 156990 157585 157868 157871 157918 163149 163586\n",
" 167184 167305 172787 176049 177195 178208 181966 182992 183106 184379\n",
" 189587 189701 189878 190368 191074 191267 191359 191544 191690 192382\n",
" 192529 192584 192687 195383 197586 198868 199896 201098 201601 203324\n",
" 203328 203700 204064 204079 204503 208651 212516 212644 213092 213116\n",
" 214662 214775 215132 215953 215984 218442 219025 219892 220725 221018\n",
" 221041 222133 222419 223366 223572 223578 223618 226814 226877 229712\n",
" 229730 230076 230476 231978 233258 234574 234632 234633 234705 235616\n",
" 235634 235644 237107 237426 238222 238366 238466 239499 239501 240222\n",
" 241254 241445 243393 243547 243699 243749 243848 244004 244333 245347\n",
" 245556 247673 247995 248296 248971 249167 249239 249607 249828 249963\n",
" 250761 251477 251866 251881 251891 251904 252124 252774 254344 254395\n",
" 255403 255556 258403 261056 261473 261925 262560 262826 263080 263274\n",
" 263324 263877 268375 272521 274382 274475 275992 276071 276864 279863\n",
" 280143 280149 281144 281674]\n"
]
}
],
"source": [
"# print the removed indices\n",
"print(\"Removed indices: \", id_rus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above indices are removed from the original dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will demonstrate random oversampling. The process will be the same as random undersampling."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"from imblearn.over_sampling import RandomOverSampler\n",
"\n",
"ros = RandomOverSampler()\n",
"\n",
"X_ros, y_ros = ros.fit_sample(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"283823 new random points generated\n"
]
}
],
"source": [
"print(X_ros.shape[0] - X.shape[0], 'new random points generated')"
]
},
{
"attachments": {
"Tomek%20links.jpg": {
"image/jpeg": ""
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11.\tUnder-sampling : Tomek links\n",
"\n",
"\n",
"Tomek links are defined as the two observations of different classes which are nearest neighbours of each other.\n",
"\n",
"\n",
"The figure below illustrate the concept of Tomek links-\n",
"\n",
"\n",
"\n",
"![Tomek%20links.jpg](attachment:Tomek%20links.jpg)\n",
"\n",
"\n",
"\n",
"We can see in the above image that the Tomek links (circled in green) are given by the pairs of red and blue data points that are nearest neighbors. Most of the classification algorithms face difficulty due to these points. So, I will remove these \n",
"points and increase the separation gap between two classes. Now, the algorithms produce more reliable output.\n",
"\n",
"This technique will not produce a balanced dataset. It will simply clean the dataset by removing the Tomek links. It may result in an easier classification problem. Thus, by removing the Tomek links, we can improve the performance of the classifier even if we don’t have a balanced dataset.\n",
"\n",
"\n",
"So, removing the Tomek links increases the gap between the two classes and thus facilitate the classification process.\n",
"\n",
"\n",
"In the following code, I will use `ratio=majority` to resample the majority class.\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"from imblearn.under_sampling import TomekLinks\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"tl = TomekLinks(return_indices=True, ratio='majority')\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"X_tl, y_tl, id_tl = tl.fit_sample(X, y)\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Removed indexes: [ 0 1 2 ... 284804 284805 284806]\n"
]
}
],
"source": [
"print('Removed indexes:', id_tl)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Under-sampling : Cluster Centroids\n",
"\n",
"\n",
"In this technique, we perform under-sampling by generating centroids based on clustering methods. The dataset will be grouped\n",
"by similarity, in order to preserve information.\n",
"\n",
"In this example, I have passed the {0: 10} dict for the parameter ratio. It preserves 10 elements from the majority class (0), and all minority class (1) ."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"from imblearn.under_sampling import ClusterCentroids"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"cc = ClusterCentroids(ratio={0: 10})"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"X_cc, y_cc = cc.fit_sample(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"284305 New points undersampled under Cluster Centroids\n"
]
}
],
"source": [
"print(X.shape[0] - X_cc.shape[0], 'New points undersampled under Cluster Centroids')"
]
},
{
"attachments": {
"smote.png": {
"image/png": ""
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13.\tOver-sampling : SMOTE\n",
"\n",
"\n",
"\n",
"In the context of synthetic data generation, there is a powerful and widely used method known as **synthetic minority oversampling technique** or **SMOTE**. Under this technique, artificial data is created based on feature space. \n",
"Artificial data is generated with bootstrapping and k-nearest neighbours algorithm. It works as follows:-\n",
"\n",
"\n",
"1.\tFirst of all, we take the difference between the feature vector (sample) under consideration and its nearest neighbour.\n",
"\n",
"\n",
"2.\tThen we multiply this difference by a random number between 0 and 1.\n",
"\n",
"\n",
"3.\tThen we add this number to the feature vector under consideration.\n",
"\n",
"\n",
"4.\tThus we select a random point along the line segment between two specific features.\n",
"\n",
"\n",
"The concept of **SMOTE** can best be illustrated with the following figure:-\n",
"\n",
"\n",
"![smote.png](attachment:smote.png)\n",
"\n",
"\n",
"So, **SMOTE** generates new observations by interpolation between existing observations in the dataset.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"from imblearn.over_sampling import SMOTE"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"smote = SMOTE(ratio='minority')"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"X_sm, y_sm = smote.fit_sample(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"283823 New points created under SMOTE\n"
]
}
],
"source": [
"print(X_sm.shape[0] - X.shape[0], 'New points created under SMOTE')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Conclusion\n",
"\n",
"\n",
"In this jupyter notebook, I have discussed various approaches to deal with the problem of imbalanced classes. These are `random oversampling`, `random undersampling`, `tree-based algorithms`, `resampling with imbalanced learn library`, `under-sampling : Tomek links`, `under-sampling : Cluster Centroids` and `over-sampling : SMOTE`.\n",
"\n",
"\n",
"Some combination of these approaches will help us to create a better classifier. Simple sampling techniques may handle slight imbalance whereas more advanced methods like ensemble methods are required for extreme imbalances. The most effective technique will vary according to the dataset.\n",
"\n",
"\n",
"So, based on the above discussion, we can conclude that there is no one solution to deal with the imbalanced classes problem. \n",
"We should try out multiple methods to select the best-suited sampling techniques for the dataset in hand. The most effective technique will vary according to the characteristics of the dataset.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment