Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pb111/88545fa33780928694388779af23bf58 to your computer and use it in GitHub Desktop.
Save pb111/88545fa33780928694388779af23bf58 to your computer and use it in GitHub Desktop.
Random Forest Classification with Python and Scikit-Learn
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Forest Classification with Python and Scikit-Learn\n",
"\n",
"\n",
"Random Forest is a supervised machine learning algorithm which is based on ensemble learning. In this project, I build two Random Forest Classifier models to predict the safety of the car, one with 10 decision-trees and another one with 100 decision-trees. The expected accuracy increases with number of decision-trees in the model. I have demonstrated the **feature selection process** using the Random Forest model to find only the important features, rebuild the model using these features and see its effect on accuracy. I have used the **Car Evaluation Data Set** for this project, downloaded from the UCI Machine Learning Repository website.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"1.\tIntroduction to Random Forest algorithm\n",
"2.\tRandom Forest algorithm intuition\n",
"3.\tAdvantages and disadvantages of Random Forest algorithm\n",
"4.\tFeature selection with Random Forests\n",
"5.\tThe problem statement\n",
"6.\tDataset description\n",
"7.\tImport libraries\n",
"8.\tImport dataset\n",
"9.\tExploratory data analysis\n",
"10.\tDeclare feature vector and target variable\n",
"11.\tSplit data into separate training and test set\n",
"12.\tFeature engineering\n",
"13.\tRandom Forest Classifier model with default parameters\n",
"14.\tRandom Forest Classifier model with parameter n_estimators=100\n",
"15.\tFind important features with Random Forest model\n",
"16.\tVisualize the feature scores of the features\n",
"17.\tBuild the Random Forest model on selected features\n",
"18.\tConfusion matrix\n",
"19.\tClassification report\n",
"20.\tResults and conclusion\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to Random Forest algorithm\n",
"\n",
"\n",
"\n",
"Random forest is a supervised learning algorithm. It has two variations – one is used for classification problems and other is used for regression problems. It is one of the most flexible and easy to use algorithm. It creates decision trees on the given data samples, gets prediction from each tree and selects the best solution by means of voting. It is also a pretty good indicator of feature importance.\n",
"\n",
"\n",
"Random forest algorithm combines multiple decision-trees, resulting in a forest of trees, hence the name `Random Forest`. In the random forest classifier, the higher the number of trees in the forest results in higher accuracy.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Random Forest algorithm intuition\n",
"\n",
"\n",
"Random forest algorithm intuition can be divided into two stages. \n",
"\n",
"\n",
"In the first stage, we randomly select “k” features out of total `m` features and build the random forest. In the first stage, we proceed as follows:-\n",
"\n",
"1.\tRandomly select `k` features from a total of `m` features where `k < m`.\n",
"2.\tAmong the `k` features, calculate the node `d` using the best split point.\n",
"3.\tSplit the node into daughter nodes using the best split.\n",
"4.\tRepeat 1 to 3 steps until `l` number of nodes has been reached.\n",
"5.\tBuild forest by repeating steps 1 to 4 for `n` number of times to create `n` number of trees.\n",
"\n",
"\n",
"In the second stage, we make predictions using the trained random forest algorithm. \n",
"\n",
"1.\tWe take the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome.\n",
"2.\tThen, we calculate the votes for each predicted target.\n",
"3.\tFinally, we consider the high voted predicted target as the final prediction from the random forest algorithm.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Advantages and disadvantages of Random Forest algorithm\n",
"\n",
"\n",
"The advantages of Random forest algorithm are as follows:-\n",
"\n",
"\n",
"1.\tRandom forest algorithm can be used to solve both classification and regression problems.\n",
"2.\tIt is considered as very accurate and robust model because it uses large number of decision-trees to make predictions.\n",
"3.\tRandom forests takes the average of all the predictions made by the decision-trees, which cancels out the biases. So, it does not suffer from the overfitting problem. \n",
"4.\tRandom forest classifier can handle the missing values. There are two ways to handle the missing values. First is to use median values to replace continuous variables and second is to compute the proximity-weighted average of missing values.\n",
"5.\tRandom forest classifier can be used for feature selection. It means selecting the most important features out of the available features from the training dataset.\n",
"\n",
"\n",
"The disadvantages of Random Forest algorithm are listed below:-\n",
"\n",
"\n",
"1.\tThe biggest disadvantage of random forests is its computational complexity. Random forests is very slow in making predictions because large number of decision-trees are used to make predictions. All the trees in the forest have to make a prediction for the same input and then perform voting on it. So, it is a time-consuming process.\n",
"2.\tThe model is difficult to interpret as compared to a decision-tree, where we can easily make a prediction as compared to a decision-tree.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Feature selection with Random Forests\n",
"\n",
"\n",
"\n",
"Random forests algorithm can be used for feature selection process. This algorithm can be used to rank the importance of variables in a regression or classification problem. \n",
"\n",
"\n",
"We measure the variable importance in a dataset by fitting the random forest algorithm to the data. During the fitting process, the out-of-bag error for each data point is recorded and averaged over the forest. \n",
"\n",
"\n",
"The importance of the j-th feature was measured after training. The values of the j-th feature were permuted among the training data and the out-of-bag error was again computed on this perturbed dataset. The importance score for the j-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.\n",
"\n",
"\n",
"Features which produce large values for this score are ranked as more important than features which produce small values. Based on this score, we will choose the most important features and drop the least important ones for model building. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. The problem statement\n",
"\n",
"\n",
"The problem is to predict the safety of the car. In this project, I build a Decision Tree Classifier to predict the safety of the car. I implement Decision Tree Classification with Python and Scikit-Learn. I have used the **Car Evaluation Data Set** for this project, downloaded from the UCI Machine Learning Repository website.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Dataset description\n",
"\n",
"\n",
"I have used the **Car Evaluation Data Set** downloaded from the Kaggle website. I have downloaded this data set from the Kaggle website. The data set can be found at the following url:-\n",
"\n",
"\n",
"http://archive.ics.uci.edu/ml/datasets/Car+Evaluation\n",
"\n",
"\n",
"Car Evaluation Database was derived from a simple hierarchical decision model originally developed for expert system for decision making. The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety. \n",
"\n",
"It was donated by Marko Bohanec."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/car.data'\n",
"\n",
"df = pd.read_csv(data, header=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1728, 7)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 1728 instances and 7 variables in the data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View top 5 rows of dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6\n",
"0 vhigh vhigh 2 2 small low unacc\n",
"1 vhigh vhigh 2 2 small med unacc\n",
"2 vhigh vhigh 2 2 small high unacc\n",
"3 vhigh vhigh 2 2 med low unacc\n",
"4 vhigh vhigh 2 2 med med unacc"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rename column names\n",
"\n",
"We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']\n",
"\n",
"\n",
"df.columns = col_names\n",
"\n",
"col_names"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety class\n",
"0 vhigh vhigh 2 2 small low unacc\n",
"1 vhigh vhigh 2 2 small med unacc\n",
"2 vhigh vhigh 2 2 small high unacc\n",
"3 vhigh vhigh 2 2 med low unacc\n",
"4 vhigh vhigh 2 2 med med unacc"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's again preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the column names are renamed. Now, the columns have meaningful names."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary of dataset"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 1728 entries, 0 to 1727\n",
"Data columns (total 7 columns):\n",
"buying 1728 non-null object\n",
"maint 1728 non-null object\n",
"doors 1728 non-null object\n",
"persons 1728 non-null object\n",
"lug_boot 1728 non-null object\n",
"safety 1728 non-null object\n",
"class 1728 non-null object\n",
"dtypes: object(7)\n",
"memory usage: 94.6+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Frequency distribution of values in variables\n",
"\n",
"Now, I will check the frequency counts of categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"high 432\n",
"med 432\n",
"low 432\n",
"vhigh 432\n",
"Name: buying, dtype: int64\n",
"high 432\n",
"med 432\n",
"low 432\n",
"vhigh 432\n",
"Name: maint, dtype: int64\n",
"5more 432\n",
"4 432\n",
"2 432\n",
"3 432\n",
"Name: doors, dtype: int64\n",
"4 576\n",
"2 576\n",
"more 576\n",
"Name: persons, dtype: int64\n",
"small 576\n",
"big 576\n",
"med 576\n",
"Name: lug_boot, dtype: int64\n",
"high 576\n",
"med 576\n",
"low 576\n",
"Name: safety, dtype: int64\n",
"unacc 1210\n",
"acc 384\n",
"good 69\n",
"vgood 65\n",
"Name: class, dtype: int64\n"
]
}
],
"source": [
"col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']\n",
"\n",
"\n",
"for col in col_names:\n",
" \n",
" print(df[col].value_counts()) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `doors` and `persons` are categorical in nature. So, I will treat them as categorical variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of variables\n",
"\n",
"\n",
"- There are 7 variables in the dataset. All the variables are of categorical data type.\n",
"\n",
"\n",
"- These are given by `buying`, `maint`, `doors`, `persons`, `lug_boot`, `safety` and `class`.\n",
"\n",
"\n",
"- `class` is the target variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `class` variable"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"unacc 1210\n",
"acc 384\n",
"good 69\n",
"vgood 65\n",
"Name: class, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['class'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `class` target variable is ordinal in nature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Missing values in variables"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"buying 0\n",
"maint 0\n",
"doors 0\n",
"persons 0\n",
"lug_boot 0\n",
"safety 0\n",
"class 0\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in variables\n",
"\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in the dataset. I have checked the frequency distribution of values previously. It also confirms that there are no missing values in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(['class'], axis=1)\n",
"\n",
"y = df['class']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# split data into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1157, 6), (571, 6))"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of X_train and X_test\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Feature Engineering\n",
"\n",
"\n",
"**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.\n",
"\n",
"\n",
"First, I will check the data types of variables again."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"buying object\n",
"maint object\n",
"doors object\n",
"persons object\n",
"lug_boot object\n",
"safety object\n",
"dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check data types in X_train\n",
"\n",
"X_train.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode categorical variables\n",
"\n",
"\n",
"Now, I will encode the categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>3</td>\n",
" <td>more</td>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>468</th>\n",
" <td>high</td>\n",
" <td>vhigh</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>small</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>vhigh</td>\n",
" <td>high</td>\n",
" <td>3</td>\n",
" <td>more</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1721</th>\n",
" <td>low</td>\n",
" <td>low</td>\n",
" <td>5more</td>\n",
" <td>more</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1208</th>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" <td>2</td>\n",
" <td>more</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety\n",
"48 vhigh vhigh 3 more med low\n",
"468 high vhigh 3 4 small low\n",
"155 vhigh high 3 more small high\n",
"1721 low low 5more more small high\n",
"1208 med low 2 more small high"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the variables are ordinal categorical data type."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# import category encoders\n",
"\n",
"import category_encoders as ce"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# encode categorical variables with ordinal encoding\n",
"\n",
"encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])\n",
"\n",
"\n",
"X_train = encoder.fit_transform(X_train)\n",
"\n",
"X_test = encoder.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>468</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1721</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1208</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety\n",
"48 1 1 1 1 1 1\n",
"468 2 1 1 2 2 1\n",
"155 1 2 1 1 2 2\n",
"1721 3 3 2 1 2 2\n",
"1208 4 3 3 1 2 2"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>599</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1201</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>628</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1498</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1263</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety\n",
"599 2 2 4 3 1 2\n",
"1201 4 3 3 2 1 3\n",
"628 2 2 2 3 3 3\n",
"1498 3 2 2 2 1 3\n",
"1263 4 3 4 1 1 1"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have training and test set ready for model building. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Random Forest Classifier model with default parameters"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with 10 decision-trees : 0.9247\n"
]
}
],
"source": [
"# import Random Forest classifier\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"\n",
"\n",
"# instantiate the classifier \n",
"\n",
"rfc = RandomForestClassifier(random_state=0)\n",
"\n",
"\n",
"\n",
"# fit the model\n",
"\n",
"rfc.fit(X_train, y_train)\n",
"\n",
"\n",
"\n",
"# Predict the Test set results\n",
"\n",
"y_pred = rfc.predict(X_test)\n",
"\n",
"\n",
"\n",
"# Check accuracy score \n",
"\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, **y_test** are the true class labels and **y_pred** are the predicted class labels in the test-set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, I have build the Random Forest Classifier model with default parameter of `n_estimators = 10`. So, I have used 10 decision-trees to build the model. Now, I will increase the number of decision-trees and see its effect on accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Random Forest Classifier model with parameter n_estimators=100"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with 100 decision-trees : 0.9457\n"
]
}
],
"source": [
"# instantiate the classifier with n_estimators = 100\n",
"\n",
"rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)\n",
"\n",
"\n",
"\n",
"# fit the model to the training set\n",
"\n",
"rfc_100.fit(X_train, y_train)\n",
"\n",
"\n",
"\n",
"# Predict on the test set results\n",
"\n",
"y_pred_100 = rfc_100.predict(X_test)\n",
"\n",
"\n",
"\n",
"# Check accuracy score \n",
"\n",
"print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model accuracy score with 10 decision-trees is 0.9247 but the same with 100 decision-trees is 0.9457. So, as expected accuracy increases with number of decision-trees in the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Find important features with Random Forest model\n",
"\n",
"\n",
"Until now, I have used all the features given in the model. Now, I will select only the important features, build the model using these features and see its effect on accuracy. \n",
"\n",
"\n",
"First, I will create the Random Forest model as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,\n",
" oob_score=False, random_state=0, verbose=0, warm_start=False)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create the classifier with n_estimators = 100\n",
"\n",
"clf = RandomForestClassifier(n_estimators=100, random_state=0)\n",
"\n",
"\n",
"\n",
"# fit the model to the training set\n",
"\n",
"clf.fit(X_train, y_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will use the feature importance variable to see feature importance scores."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"safety 0.295319\n",
"persons 0.233856\n",
"buying 0.151734\n",
"maint 0.146653\n",
"lug_boot 0.100048\n",
"doors 0.072389\n",
"dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the feature scores\n",
"\n",
"feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)\n",
"\n",
"feature_scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the most important feature is `safety` and least important feature is `doors`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Visualize the feature scores of the features\n",
"\n",
"\n",
"Now, I will visualize the feature scores with matplotlib and seaborn."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAaAAAAEWCAYAAAAgpUMxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3XucXdP9//HXO6EikiZC3EUkgqJERVvfupZf0V+VtpRKVcqPurS032qrpaqKUr2qloYSSisupaVupQRRl4RcqGjIBaUqLkmEhsTn98deIzvjzMyZmXNmnZm8n4/HeWSfvdde+7P2mZzPWWvfFBGYmZl1tV65AzAzsxWTE5CZmWXhBGRmZlk4AZmZWRZOQGZmloUTkJmZZeEEZF1G0gWSvlvnbdwl6f+l6dGSbqtinZslHVrPuMzs3ZyArCYk3SrptArz95X0b0krRcRREfGDroopIq6IiI9VUW7viLi01tuXtKukZ2tdb0dIGiopJK1Uo/rabJukcZLelPRa6XVgDbYdkjbpbD2WnxOQ1co44BBJajb/EOCKiFjS9SEZQK2STgf9KCL6lV7jM8YCgKTeuWOwghOQ1cr1wCBgp6YZklYHPgFclt6Pk3R6ml5T0o2SXpX0sqR7JPVKy5b7hdtsvdXTei9KeiVNb1ApIEljJN2bpr/Z7Jf4W5LGpWXlYbsxku6V9ONU/2xJe5fq3FjS3ZIWSrpd0q8kXV7NDkrbOV3SfSmGGyStIekKSQskPSRpaKl8SDpO0ixJ8ySdU9pHvSSdLGmupP9IukzSgLSsqbdzuKSngb8Bd6dqX03b3kHScEl/k/RSqv8KSQNL258j6QRJ0yTNlzReUh9JqwE3A+uV9ud61eyDUt3rSbo2fY6zJR1XWvZBSX9PfxvPSzpP0nvSsqZ2TG3qUZU/52b7bpM0PU7S+ZJukrQI2E3SKukzflrSCyqGh1dN5Vv827Ta8k61moiIN4CrgC+UZn8WmBERUyus8nXgWWAwsDbwHaCa+0L1Ai4BNgKGAG8A51UR3zu/xIH3AS+meCv5EPAEsCbwI+C3pZ7d74EHgTWAUyl6eO1xUFpnfWA48PfUnkHA48D3mpX/FDAK+ACwL3BYmj8mvXYDhgH9ePd+2IWirXsCO6d5A9N++Dsg4IfAeqnchqlNZZ8F9gI2BrYGxkTEImBv4LlSz+a5andA+jK/AZia9sPuwFcl7ZmKLAW+RrH/d0jLjwGIiKZ2bNPOHtXBwBlAf+Be4GxgU2AksEmK45RUtqN/m9ZOTkBWS5cCBzT9kqRIRi0dW3kLWBfYKCLeioh7ooobE0bESxFxbUS8HhELKb5Udqk2wBTb9cAvIuKmForNjYgLI2Jpin9dYG1JQ4DtgVMi4s2IuBf4c7XbTi6JiKciYj5FL+KpiLg9DVFeDWzbrPzZEfFyRDwN/Bz4XJo/GvhpRMyKiNeAbwMHafnhtlMjYlH6cfAuEfFkRPw1IhZHxIvAT3n3vjw3Ip6LiJcpksbIdrb3hNSTeFXSvDRve2BwRJyW9uMs4EKK5ExETI6I+yNiSUTMAX5TIa72+lNETIyIt4HFwBHA19K+XQic2bR9Ovi3ae3nBGQ1k76QXwT2lTSM4ovm9y0UPwd4ErgtDTGdWM02JPWV9Js09LSAYmhpoKof1/8t8EREnN1KmX83TUTE62myH0VP4eXSPIBnqtxukxdK029UeN+vWfly/XNTDKR/5zZbthLFL/aqYpO0lqQrJf0r7cvLKXodZf8uTb9eIb62/DgiBqZXU90bUQzfNSWmVyl6GWunuDZNQ2D/TnGdWSGu9irvi8FAX2Byafu3pPnQwb9Naz8nIKu1yyh6PocAt0XEC5UKRcTCiPh6RAwD9gH+V9LuafHrFF8QTdYpTX8d2Az4UES8l2VDS81PfniX9EWyGXB4O9pT9jwwSFI5tg07WFe1yvUPAZqGup6j+CIvL1vC8gktWphu8sM0f+u0Lz9PFfuxlfqq9Qwwu5SYBkZE/4j4eFp+PjADGJHi+k4bcS2i9PciaZ0KZcrxzqNI9luWtj8gDc+29bdpNeQEZLV2GbAHxRBHi6c2S/qEpE3SsZUFFOP+S9PiKcDBknpL2ovlh1/6U3x5vCppEO8+ZtLS9vYGjgP2a2lIqi0RMReYBJwq6T2SdqD4gqqnb6g48WJD4Hig6ZjHH4CvqTgpoh9FL2F8K2cbvgi8TXG8qEl/4DWKfbk+8I12xPUCsEbTiQ/t9CCwQNK3JK2aPuetJG1fimsB8JqkzYGjK2y73I6pwJaSRkrqw7uPYy0nDcNdCPxM0loAktZvOgbVxt+m1ZATkNVUGrO/D1iN1o+PjABup/gC/Dvw64i4Ky07nuKL/VWKYx3Xl9b7ObAqxa/Y+ymGTqpxIMUQy+OlM7cuqHLdstEUB8ZfAk6nSAiLO1BPtf4ETKZIyn+hGEIEuBj4HcUQ5Gzgv8BXWqokDRueAUxMw04fBr5PcXLD/FT3H6sNKiJmUCTBWam+qs+CS8fW9qE4njSb4rO8CGhKZidQnDSwkCJRND/R4FTg0rTdz0bEP4HTKP6eZlKcZNCWb1EMs92fhvlup+gdQ+t/m1ZD8rE1s46TNJ7iTL+qemLtrDsohqGerHXdZo3APSCzdpC0vYrrZ3ql4cF9Wb6HZmZVynmFtFl3tA7FUNUaFNeKHB0Rj+QNyax78hCcmZll4SE4MzPLwkNwrVhzzTVj6NChucMwM+tWJk+ePC8iBrdVzgmoFUOHDmXSpEm5wzAz61YkzW27lIfgzMwsEycgMzPLwkNwrXj82ZfY7huX5Q7DzKxLTT7nC20XqgH3gMzMLAsnIDMzy8IJyMzMsnACMjOzLJyAzMwsCycgMzPLwgnIzMyycAIyM7MsnIDMzCwLJyAzM8uiWyYgSZtLmiLpEUnDWyn3na6My8zMqtctExCwH/CniNg2Ip5qpZwTkJlZg2qYm5FKWg24CtgA6A38ANgM2AdYFbgP+BKwN/BVYKmknSNiN0mfB44D3gM8ABwDnAGsKmkK8BgwC5gXEb9I2zsDeCEizu26VpqZWZNG6gHtBTwXEdtExFbALcB5EbF9er8q8ImIuAm4APhZSj7vAw4EPhIRI4GlwOiIOBF4IyJGRsRo4LfAoQCSegEHAVc0D0LSkZImSZq05PWF9W+1mdkKqpES0HRgD0lnS9opIuYDu0l6QNJ04KPAlhXW2x3YDngo9XZ2B4Y1LxQRc4CXJG0LfAx4JCJeqlBubESMiohRK/XtX7PGmZnZ8hpmCC4i/ilpO+DjwA8l3QYcC4yKiGcknQr0qbCqgEsj4ttVbOYiYAywDnBxTQI3M7MOaZgekKT1gNcj4nLgx8AH0qJ5kvoB+7ew6h3A/pLWSvUMkrRRWvaWpJVLZa+jGOrbHri11m0wM7PqNUwPCHg/cI6kt4G3gKMpznabDswBHqq0UkT8Q9LJwG3p2M5bFD2nucBYYJqkhyNidES8KelO4NWIWFr3FpmZWYsUEblj6DIpQT0MHBARM9sqv9o6G8fmh3y//oGZmTWQzj6SW9LkiBjVVrmGGYKrN0lbAE8Cd1STfMzMrL4aaQiuriLiH1Q4O87MzPJYYXpAZmbWWJyAzMwsCycgMzPLwgnIzMyycAIyM7MsnIDMzCwLJyAzM8tihbkOqCPet8EaTOrkFcFmZlaZe0BmZpaFE5CZmWXhBGRmZlk4AZmZWRZOQGZmloUTkJmZZeHTsFvx5vOP8fRp788dhpl1gSGnTM8dwgrHPSAzM8vCCcjMzLJwAjIzsyycgMzMLAsnIDMzy8IJyMzMsnACMjOzLJyAzMwsCycgMzPLwgnIzMyycAIyM7MsnIDMzCyLhk9AknzDVDOzHqhLEpCkoZJmSLpU0jRJ10jqK2k7SRMkTZZ0q6R1U/m7JJ0paQJwvKQDJD0qaaqku1OZPpIukTRd0iOSdkvzx0j6o6RbJM2U9KM0v7ekcame6ZK+1hVtNzOzyrqyd7EZcHhETJR0MXAs8Clg34h4UdKBwBnAYan8wIjYBUDSdGDPiPiXpIFp+bEAEfF+SZsDt0naNC0bCWwLLAaekPRLYC1g/YjYKtXZVM9yJB0JHAmw/oCVa9h8MzMr68ohuGciYmKavhzYE9gK+KukKcDJwAal8uNL0xOBcZKOAHqneTsCvwOIiBnAXKApAd0REfMj4r/AP4CNgFnAMEm/lLQXsKBSkBExNiJGRcSoQav1rlTEzMxqoCt7QNHs/ULgsYjYoYXyi95ZMeIoSR8C/i8wRdJIQK1sa3FpeimwUkS8ImkbisR3LPBZlvW2zMysi3VlD2iIpKZk8zngfmBw0zxJK0vastKKkoZHxAMRcQowD9gQuBsYnZZvCgwBnmhp45LWBHpFxLXAd4EP1KZZZmbWEV3ZA3ocOFTSb4CZwC+BW4FzJQ1IsfwceKzCuudIGkHR67kDmArMAC5Ix4eWAGMiYrHUYsdofeASSU1J99u1aZaZmXWEIpqPjNVhI9JQ4MamEwC6i63XXzVu/NImucMwsy4w5JTpuUPoMSRNjohRbZVr+OuAzMysZ+qSIbiImENxxpuZmRngHpCZmWXiBGRmZlk4AZmZWRZOQGZmloUTkJmZZeEEZGZmWfhZO614z7pbMuSUSbnDMDPrkdwDMjOzLJyAzMwsCycgMzPLwgnIzMyycAIyM7MsnIDMzCwLn4bdihn/mcFHfvmR3GGYATDxKxNzh2BWU+4BmZlZFk5AZmaWhROQmZll4QRkZmZZOAGZmVkWTkBmZpaFE5CZmWXhBGRmZlk4AZmZWRZOQGZmloUTkJmZZZE1AUkaKunRGtRzlKQv1CImMzPrGj3iZqQRcUHuGMzMrH3a3QOStLqkrWsYw0qSLpU0TdI1kvpKmiNpzbS9UZLuktRL0kxJg9P8XpKelLSmpFMlnZDm3yXpbEkPSvqnpJ3S/L6SrkrbGS/pAUmjatgOMzNrh6oSUPpSf6+kQcBU4BJJP61RDJsBYyNia2ABcEylQhHxNnA5MDrN2gOYGhHzKhRfKSI+CHwV+F6adwzwStrOD4DtKm1H0pGSJkma9NZrb3W0TWZm1oZqe0ADImIB8GngkojYjiIB1MIzEdH0oJPLgR1bKXsx0HSs5zDgkhbK/TH9OxkYmqZ3BK4EiIhHgWmVVoyIsRExKiJGrdxv5aoaYGZm7VdtAlpJ0rrAZ4EbaxxDVHi/hGWx9XlnQcQzwAuSPgp8CLi5hToXp3+Xsuw4l2oSrZmZ1US1Ceg04FbgqYh4SNIwYGaNYhgiaYc0/TngXmAOy4bIPtOs/EUUPaWrImJpO7ZzL0UCRdIWwPs7GrCZmXVeVQkoIq6OiK0j4uj0flZENE8MHfU4cKikacAg4Hzg+8AvJN1D0Ysp+zPQj5aH31rya2Bw2s63KIbg5ncmcDMz67iqTsOWtClFYlg7IrZKZ8F9MiJO78zGI2IOsEWFRfcAm7aw2jYUJx/MKNVzaml619L0PJYdA/ov8PmI+K+k4cAdwNyOR29mZp1R7RDchcC3gbcAImIacFC9gmqJpBOBa1Ms7dUXuFfSVOA64OiIeLOW8ZmZWfWqvRC1b0Q8KC13HH9JHeJpVUScBZzVwXUXAr7ux8ysQVTbA5qXhq0CQNL+wPN1i8rMzHq8antAxwJjgc0l/QuYzbILQs3MzNqtzQQkqRcwKiL2kLQa0CsNZ5mZmXVYm0Nw6RY4X07Ti5x8zMysFqo9BvRXSSdI2lDSoKZXXSMzM7MerdpjQIelf48tzQtgWG3DMTOzFUVVCSgiNq53II1o87U2Z+JXJrZd0MzM2q3aOyFUfNpoRFxW23DMzGxFUe0Q3Pal6T7A7sDDgBOQmZl1SLVDcF8pv5c0APhdXSIyM7MVQrsfyZ28DoyoZSBmZrZiqfYY0A0se3BcL4o7WF9dr6DMzKznq/YY0I9L00uAuRHxbB3iMTOzFUS1Q3Afj4gJ6TUxIp6VdHZdIzMzsx5NEdF2IenhiPhAs3nTImLrukXWADbr3z/GbvuBtgua1dAud0/IHYJZp0iaHBFtPv6m1SE4SUcDxwDD0qOsm/QHfIWmmZl1WFvHgH4P3Az8EDixNH9hRLxct6jMzKzHazUBRcR8YD7wOQBJa1FciNpPUr+IeLr+IZqZWU9U1UkIkvaRNJPiQXQTgDkUPSMzM7MOqfYsuNOBDwP/TDcm3R0fAzIzs06oNgG9FREvAb0k9YqIO4GRdYzLzMx6uGovRH1VUj/gHuAKSf+huCDVzMysQ6rtAe1Lcf+3rwK3AE8B+9QrKDMz6/mqvRv2IkkbASMi4lJJfYHe9Q3NzMx6smrPgjsCuAb4TZq1PnB9vYIyM7Oer9ohuGOBjwALACJiJrBWvYLqKEmjJJ3bRpmBko7pqpjMzKyyahPQ4oh4s+mNpJVY9niGhhERkyLiuDaKDaS4vZCZmWVUbQKaIOk7wKqS/g/Fs4BuqEdAkoZKmiHpIkmPSrpC0h6SJkqaKemD6XWfpEfSv5uldXeVdGOaPlXSxZLukjRLUlNiOgsYLmmKpHPq0QYzM2tbtadhnwgcDkwHvgTcBFxUr6CATYADgCOBh4CDgR2BTwLfAb4A7BwRSyTtAZwJfKZCPZsDu1HcPPUJSeentmwVERWvY5J0ZNoua6+ySi3bZGZmJW3dDXtIRDwdEW8DF6ZXV5gdEdNTDI8Bd0RESJoODAUGAJdKGkExFLhyC/X8JSIWA4vTtUtrt7XhiBgLjIXicQydbomZmVXU1hDcO2e6Sbq2zrGULS5Nv116/zZF0vwBcGdEbEVxPVKfKupZSvU9PjMzq7O2EpBK08PqGUg7DQD+labHtHPdhRRDcmZmllFbCShamM7tR8APJU2knRfEpnvaTUwnOPgkBDOzTFp9JLekpcAiip7QqhS34yG9j4h4b90jzMiP5LYc/Ehu6+5q8kjuiPDtdszMrC6qvQ7IzMysppyAzMwsCycgMzPLwgnIzMyycAIyM7MsnIDMzCwLJyAzM8vCCcjMzLLwzTlb0X+zzXxVuplZnbgHZGZmWTgBmZlZFk5AZmaWhROQmZll4QRkZmZZOAGZmVkWPg27Ff95dj7nff2G3GFYlb78k31yh2Bm7eAekJmZZeEEZGZmWTgBmZlZFk5AZmaWhROQmZll4QRkZmZZOAGZmVkWTkBmZpaFE5CZmWXhBGRmZlk4AZmZWRZ1S0CSXqtxfadKOqEG9QyVdHAtYjIzs45bEXtAQwEnIDOzzOqegCTtKunG0vvzJI1J0x+XNEPSvZLOLZdrwTaS/iZppqQjUh2SdI6kRyVNl3Rga/OBs4CdJE2R9LUK8R4paZKkSa+9Pr8Wu8DMzCrI9jgGSX2A3wA7R8RsSX+oYrWtgQ8DqwGPSPoLsAMwEtgGWBN4SNLdwP+0MP9E4ISI+ESlDUTEWGAswJB1RkQnmmhmZq3IOQS3OTArIman99UkoD9FxBsRMQ+4E/ggsCPwh4hYGhEvABOA7VuZb2ZmDaArEtCSZtvpk/5VB+pq3iOJVurpSP1mZtZFuiIBzQW2kLSKpAHA7mn+DGCYpKHp/YEV1m1uX0l9JK0B7Ao8BNwNHCipt6TBwM7Ag63MXwj0r0nLzMysw+p+DCginpF0FTANmAk8kua/IekY4BZJ8yiSQ1seBP4CDAF+EBHPSbqO4jjQVIoe0Tcj4t+tzH8JWCJpKjAuIn5W0wabmVlVFJHvOLukfhHxmiQBvwJmNlJCGLLOiPjm6J/mDsOq9OWf7JM7BDMDJE2OiFFtlct9HdARkqYAjwEDKM6KMzOzFUC207ABUm9nuR6PpC8CxzcrOjEiju2ywMzMrO6yJqBKIuIS4JLccZiZWX3lHoIzM7MVlBOQmZll4QRkZmZZOAGZmVkWTkBmZpZFw50F10jW2mCAL240M6sT94DMzCwLJyAzM8vCCcjMzLJwAjIzsyycgMzMLAsnIDMzy8KnYbfi+dlPccbn988dRrdz0uXX5A7BzLoB94DMzCwLJyAzM8vCCcjMzLJwAjIzsyycgMzMLAsnIDMzy8IJyMzMsnACMjOzLJyAzMwsCycgMzPLwgnIzMyy6BYJSNKpkk7IHYeZmdVOt0hAtSDJN141M2sgDZuAJJ0k6QlJtwObpXkjJd0vaZqk6ySt3sb8uySdKWkCcLykAyQ9KmmqpLvztc7MzBoyAUnaDjgI2Bb4NLB9WnQZ8K2I2BqYDnyvjfkAAyNil4j4CXAKsGdEbAN8soVtHylpkqRJi/67uNZNMzOzpCETELATcF1EvB4RC4A/A6tRJJMJqcylwM6SBlSaX6prfGl6IjBO0hFA70objoixETEqIkat1meVGjbJzMzKGjUBAUSN6ln0ToURRwEnAxsCUyStUaNtmJlZOzVqArob+JSkVSX1B/ahSCSvSNoplTkEmBAR8yvNr1SppOER8UBEnALMo0hEZmaWQUOeGRYRD0saD0wB5gL3pEWHAhdI6gvMAr7YxvzmzpE0AhBwBzC1Tk0wM7M2NGQCAoiIM4AzKiz6cIWyU1qYv2uz95+uVXxmZtY5jToEZ2ZmPZwTkJmZZeEEZGZmWTgBmZlZFk5AZmaWhROQmZll4QRkZmZZOAGZmVkWDXshaiNYd+PhnHT5NbnDMDPrkdwDMjOzLJyAzMwsCycgMzPLQhG1euxOzyNpIfBE7jjqYE2Kx1H0NG5X9+J2dS/taddGETG4rUI+CaF1T0TEqNxB1JqkSW5X9+F2dS9uV/U8BGdmZlk4AZmZWRZOQK0bmzuAOnG7uhe3q3txu6rkkxDMzCwL94DMzCwLJyAzM8tihU1AkvaS9ISkJyWdWGH5KpLGp+UPSBpaWvbtNP8JSXt2Zdyt6WibJA2V9IakKel1QVfH3poq2rWzpIclLZG0f7Nlh0qamV6Hdl3Ubetku5aWPq8/d13UbauiXf8r6R+Spkm6Q9JGpWXd+fNqrV3d+fM6StL0FPu9krYoLevcd2FErHAvoDfwFDAMeA8wFdiiWZljgAvS9EHA+DS9RSq/CrBxqqd3N2/TUODR3G3oRLuGAlsDlwH7l+YPAmalf1dP06vnblNn25WWvZa7DZ1o125A3zR9dOnvsLt/XhXb1QM+r/eWpj8J3JKmO/1duKL2gD4IPBkRsyLiTeBKYN9mZfYFLk3T1wC7S1Kaf2VELI6I2cCTqb7cOtOmRtZmuyJiTkRMA95utu6ewF8j4uWIeAX4K7BXVwRdhc60q5FV0647I+L19PZ+YIM03d0/r5ba1ciqadeC0tvVgKYz1zr9XbiiJqD1gWdK759N8yqWiYglwHxgjSrXzaEzbQLYWNIjkiZI2qnewbZDZ/Z3o35W0PnY+kiaJOl+SfvVNrROaW+7Dgdu7uC6Xakz7YJu/nlJOlbSU8CPgOPas25rVtRb8VT61d/8fPSWylSzbg6dadPzwJCIeEnSdsD1krZs9ssnl87s70b9rKDzsQ2JiOckDQP+Jml6RDxVo9g6o+p2Sfo8MArYpb3rZtCZdkE3/7wi4lfAryQdDJwMHFrtuq1ZUXtAzwIblt5vADzXUhlJKwEDgJerXDeHDrcpdaFfAoiIyRRjuZvWPeLqdGZ/N+pnBZ2MLSKeS//OAu4Ctq1lcJ1QVbsk7QGcBHwyIha3Z91MOtOubv95lVwJNPXgOv955T4IluNF0fObRXHgrOnA25bNyhzL8gfsr0rTW7L8gbdZNMZJCJ1p0+CmNlAcjPwXMCh3m6ptV6nsON59EsJsigPaq6fpntCu1YFV0vSawEyaHThu5HZRfPk+BYxoNr9bf16ttKu7f14jStP7AJPSdKe/C7PvgIw7/uPAP9MfzElp3mkUv1wA+gBXUxxYexAYVlr3pLTeE8DeudvS2TYBnwEeS39MDwP75G5LO9u1PcWvsUXAS8BjpXUPS+19Evhi7rbUol3A/wDT0+c1HTg8d1va2a7bgReAKen15x7yeVVsVw/4vH6Rvh+mAHdSSlCd/S70rXjMzCyLFfUYkJmZZeYEZGZmWTgBmZlZFk5AZmaWhROQmZll4QRk3VazOwxPKd+xvB11DJR0TO2je6f+MZLOq1f9LWxzv/Idi7t422tLulHS1HRn6JtyxGHdgxOQdWdvRMTI0mtOB+oYSHGX8HaR1LsD26q7dIeL/SjuVJzDaRQ3FN0mIrYA3nV7//ZKbbIeyAnIehRJvSWdI+mh9FyWL6X5/dIzWh5OzzZpuuPvWcDw1IM6R9Kukm4s1XeepDFpeo6kUyTdCxwgabikWyRNlnSPpM3biG2cpPMl3SlplqRdJF0s6XFJ40rlXpP0kxTrHZIGp/kj080sp0m6TtLqaf5dks6UNAH4FsUt889JbRou6Yi0P6ZKulZS31I850q6L8WzfymGb6b9NFXSWWleNe1dl+LiWQCiuJt3a3VW06bjJQ1OsT+UXh9pbV9bN5H7Kly//OroC1jKsqvOr0vzjgROTtOrAJMobhOyEum5JhS3Q3mS4maKQyk9CwnYFbix9P48YEyangN8s7TsDtJtSoAPAX+rEOMY4Lw0PY7iXlpNj/VYALyf4ofgZGBkKhfA6DR9Smn9acAuafo04Odp+i7g16VtjmP5W/esUZo+HfhKqdzVaftbUNyWH2Bv4D6WPdtmUDvauyfwKsUV8ycB67VRZ7Vt+j2wY5oeAjye++/Pr86/3LW17uyNiBjZbN7HgK1Lv+YHACMofpWfKWlniufrrA+s3YFtjoeiR0Vxi5WrteyRSqtUsf4NERGSpgMvRMT0VN9jFMlwSopvfCp/OfBHSQOAgRExIc2/lCJ5LBdXC7aSdDrFcGM/4NbSsusj4m3gH5Ka9scewCWRnm0TES9X296IuDXd8XkviqTziKStWqizPW3aA9iitO33SuofEQtbabc1OCcg62lE8Qv/1uVmFsNog4HtIuItSXMo7o3X3BKWH5puXmZR+rcX8GqFBNiWpjskv12abnrf0v/Hau6XtaiVZeOA/SJiatoPu1aIB5bdXl8Vtll1eyPiZYoey+/TcObOLdTZlnKbegE7RMQb7azDGpiPAVlPcytwtKSVASRtKmk1ip7Qf1Ly2Q3YKJVfCPQvrT+X4pf2KukX+u6VNhLFs5JmSzogbUfksJMaAAABK0lEQVSStqlRG3oBTT24g4F7I2I+8IqWPSzwEGBCpZV5d5v6A8+nfTK6iu3fBhxWOlY0qNr2Svpoab3+wHDg6RbqbE+bbgO+XNpOexO/NSD3gKynuYhiKOthFeM1L1KcFXYFcIOkSRTDXDMAongI30RJjwI3R8Q3JF1FcWxiJvBIK9saDZwv6WRgZYrjO1Nr0IZFwJaSJlM8tfbANP9Q4IL0JT4L+GIL618JXCjpOIpE9l3gAYrkOp3lk9O7RMQt6Qt+kqQ3gZuA71Bde7cDzpPU1JO8KCIegneSRvM6q23TcRQPRJtG8b11N3BUa+2wxue7YZs1GEmvRUS/3HGY1ZuH4MzMLAv3gMzMLAv3gMzMLAsnIDMzy8IJyMzMsnACMjOzLJyAzMwsi/8PK3DmWMDsspcAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Creating a seaborn bar plot\n",
"\n",
"sns.barplot(x=feature_scores, y=feature_scores.index)\n",
"\n",
"\n",
"\n",
"# Add labels to the graph\n",
"\n",
"plt.xlabel('Feature Importance Score')\n",
"\n",
"plt.ylabel('Features')\n",
"\n",
"\n",
"\n",
"# Add title to the graph\n",
"\n",
"plt.title(\"Visualizing Important Features\")\n",
"\n",
"\n",
"\n",
"# Visualize the graph\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 17. Build the Random Forest model on selected features\n",
"\n",
"\n",
"Now, I will drop the least important feature `doors` from the model, rebuild the model and check its effect on accuracy."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# declare feature vector and target variable\n",
"\n",
"X = df.drop(['class', 'doors'], axis=1)\n",
"\n",
"y = df['class']"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# split data into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will build the random forest model and check accuracy."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# encode categorical variables with ordinal encoding\n",
"\n",
"encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'persons', 'lug_boot', 'safety'])\n",
"\n",
"\n",
"X_train = encoder.fit_transform(X_train)\n",
"\n",
"X_test = encoder.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with doors variable removed : 0.9264\n"
]
}
],
"source": [
"# instantiate the classifier with n_estimators = 100\n",
"\n",
"clf = RandomForestClassifier(random_state=0)\n",
"\n",
"\n",
"\n",
"# fit the model to the training set\n",
"\n",
"clf.fit(X_train, y_train)\n",
"\n",
"\n",
"# Predict on the test set results\n",
"\n",
"y_pred = clf.predict(X_test)\n",
"\n",
"\n",
"\n",
"# Check accuracy score \n",
"\n",
"print('Model accuracy score with doors variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I have removed the `doors` variable from the model, rebuild it and checked its accuracy. The accuracy of the model with `doors` variable removed is 0.9264. The accuracy of the model with all the variables taken into account is 0.9247. So, we can see that the model accuracy has been improved with `doors` variable removed from the model.\n",
"\n",
"Furthermore, the second least important model is `lug_boot`. If I remove it from the model and rebuild the model, then the accuracy was found to be 0.8546. It is a significant drop in the accuracy. So, I will not drop it from the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.\n",
"\n",
"\n",
"But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. \n",
"\n",
"\n",
"We have another tool called `Confusion matrix` that comes to our rescue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 18. Confusion matrix\n",
"\n",
"\n",
"A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
"\n",
"\n",
"Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
"\n",
"\n",
"**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
"\n",
"\n",
"**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
"\n",
"\n",
"**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
"\n",
"\n",
"\n",
"**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
"\n",
"\n",
"\n",
"These four outcomes are summarized in a confusion matrix given below.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[107 8 7 7]\n",
" [ 0 17 1 2]\n",
" [ 10 0 387 0]\n",
" [ 3 4 0 18]]\n"
]
}
],
"source": [
"# Print the Confusion Matrix and slice it into four pieces\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"cm = confusion_matrix(y_test, y_pred)\n",
"\n",
"print('Confusion matrix\\n\\n', cm)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 19. Classification Report\n",
"\n",
"\n",
"**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.\n",
"\n",
"We can print a classification report as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" acc 0.89 0.83 0.86 129\n",
" good 0.59 0.85 0.69 20\n",
" unacc 0.98 0.97 0.98 397\n",
" vgood 0.67 0.72 0.69 25\n",
"\n",
" micro avg 0.93 0.93 0.93 571\n",
" macro avg 0.78 0.84 0.81 571\n",
"weighted avg 0.93 0.93 0.93 571\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 20. Results and conclusion\n",
"\n",
"\n",
"1.\tIn this project, I build a Random Forest Classifier to predict the safety of the car. I build two models, one with 10 decision-trees and another one with 100 decision-trees. \n",
"2.\tThe model accuracy score with 10 decision-trees is 0.9247 but the same with 100 decision-trees is 0.9457. So, as expected accuracy increases with number of decision-trees in the model.\n",
"3.\tI have used the Random Forest model to find only the important features, build the model using these features and see its effect on accuracy. The most important feature is `safety` and least important feature is `doors`.\n",
"4.\tI have removed the `doors` variable from the model, rebuild it and checked its accuracy. The accuracy of the model with `doors` variable removed is 0.9264. The accuracy of the model with all the variables taken into account is 0.9247. So, we can see that the model accuracy has been improved with `doors` variable removed from the model.\n",
"5.\tThe second least important model is `lug_boot`. If I remove it from the model and rebuild the model, then the accuracy was found to be 0.8546. It is a significant drop in the accuracy. So, I will not drop it from the model.\n",
"6.\tConfusion matrix and classification report are another tool to visualize the model performance. They yield good performance.\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@infopene
Copy link

شكرا

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment