Created
December 28, 2018 13:50
-
-
Save pb111/724884394ea83d0dfcf766aa74f4b953 to your computer and use it in GitHub Desktop.
Data Preprocessing Project - Dealing with Text and Categorical data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Data Preprocessing Project - Dealing with Text and Categorical data\n", | |
"\n", | |
"\n", | |
"In this project, I discuss various data preprocessing techniques to deal with text and categorical data. \n", | |
"\n", | |
"The contents of this project are categorized into various sections which are listed below:-\n", | |
"\n", | |
"\n", | |
"\n", | |
"## Table of Contents:-\n", | |
"\n", | |
"\n", | |
"\n", | |
"1.\tIntroduction\n", | |
"\n", | |
"2.\tTypes of data variable \n", | |
"\n", | |
"3.\tExample dataset\n", | |
"\n", | |
"4.\tEncoding class labels with LabelEncoder\n", | |
"\n", | |
"5.\tEncoding categorical integer labels with OneHotEncoder\n", | |
"\n", | |
"6.\tEncode multi-class labels to binary labels with LabelBinarizer\n", | |
"\n", | |
"7.\tEncoding list of dictionaries with DictVectorizer\n", | |
"\n", | |
"8.\tConverting text document to word count vectors with CountVectorizer\n", | |
"\n", | |
"9.\tConverting text document to word frequency vectors with TfidfVectorizer\n", | |
"\n", | |
"10.\tTransforming a counted matrix to normalized tf-idf representation with TfidfTransformer\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"=================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 1. Introduction\n", | |
"\n", | |
"\n", | |
"In the previous project, I have discussed data preprocessing techniques to handle missing numerical data. \n", | |
"But, the real world datasets also contain text and categorical data. In this project, I will discuss \n", | |
"various techniques to deal with text and categorical data effectively.\n", | |
"\n", | |
"\n", | |
"Machine Learning algorithms require that input data must be in numerical format. Only then the algorithms \n", | |
"work successfully on them. So, the text data must be converted into numbers before they are fed into an \n", | |
"algorithm. \n", | |
"\n", | |
"\n", | |
"The process of converting text data into numbers consists of two steps. First, the text data must be parsed \n", | |
"to remove words. This process is called **tokenization**. Then the words need to be encoded as integers or \n", | |
"floating point values for use as input to a machine learning algorithm. This process is called \n", | |
"**feature extraction** or **vectorization**.\n", | |
"\n", | |
"\n", | |
"The Scikit-Learn library provides useful classes like **LabelEncoder**, **OneHotEncoder**, **LabelBinarizer**, **DictVectorizer**, **CountVectorizer** etc. to perform **tokenization** and **vectorization**. \n", | |
"In this project, I will explore these classes and the process of encoding text and categorical data \n", | |
"into numerical representation.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"======================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 2. Types of data variable\n", | |
"\n", | |
"\n", | |
"We can divide categorical data into four types. These are nominal, ordinal, interval and ratio data. These terms were developed by Stanley Smith Stevens, an American psychologist. His work was published in 1946 and these terms came into effect. These four types of data variable – nominal, ordinal, interval and ratio data are best understood with the help of examples.\n", | |
"\n", | |
"\n", | |
"\n", | |
"### Nominal variable\n", | |
"\n", | |
"\n", | |
"A categorical variable is also called a nominal variable when it has two or more categories. It is mutual exclusive, but not ordered variable. There is no ordering associated with this type of variable. Nominal scales are used for labelling variables without any quantitative value. For example, gender is a categorical variable having two categories - male and female, and there is no intrinsic ordering to the categories. Hair colour is also a categorical variable having a number of categories - black, blonde, brown, brunette, red, etc. and there is no agreed way to order these from highest to lowest. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.\n", | |
"\n", | |
"\n", | |
"\n", | |
"### Ordinal variable\n", | |
"\n", | |
"\n", | |
"In ordinal variables, there is a clear ordering of the variables. Here the order matters but not the difference between values. The order of the values is important and significant. For example, suppose we have a variable economic status with three categories - low, medium and high. In this example, we classify people into three categories. Also, we order the categories as low, medium and high. So, the categories express an order of measurement. Ordinal scales are measures of non-numeric concepts like satisfaction and happiness. It is easy to remember because it sounds like ordering into different categories. If these categories were equally spaced, then that variable would be an interval variable.\n", | |
"\n", | |
"\n", | |
"\n", | |
"### Interval variable\n", | |
"\n", | |
"\n", | |
"An interval variable is similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. Interval scales are numeric scales in which we know both the order and the exact differences between the values. The difference between two interval variables is measurable and constant. The example of an interval scale is Celsius temperature because the difference between each value is the same. For example, the difference between 60 and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees. \n", | |
"\n", | |
"\n", | |
"\n", | |
"### Ratio variable\n", | |
"\n", | |
"\n", | |
"A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. \n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"=============================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 3. Example dataset\n", | |
"\n", | |
"\n", | |
"I create an example dataset to illustrate various techniques to deal with the text and categorical data. The example dataset is about Grand Slam Tennis tournaments. The dataset contains five columns describing the Grand Slam title, host country, surface type, court speed and prize money in US Dollars Millions associated with the tournaments. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So, I will start by importing the required Python libraries." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Import required libraries\n", | |
"\n", | |
"import numpy as np\n", | |
"\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Create an example dataset\n", | |
"\n", | |
"import pandas as pd\n", | |
"\n", | |
"df = pd.DataFrame([\n", | |
" ['Australian Open', 'Australia', 'Hard Court','Medium', 3.2 ],\n", | |
" ['French Open', 'France', 'Clay Court', 'Slow', 2.7],\n", | |
" ['Wimbledon', 'UK', 'Grass Court', 'Fast', 2.91],\n", | |
" ['US Open', 'USA', 'Hard Court', 'Medium', 3.8]])\n", | |
"\n", | |
"df.columns = ['Grand Slam Title', 'Host Country', 'Surface Type', 'Court Speed', 'Prize Money(USD million)']\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Grand Slam Title</th>\n", | |
" <th>Host Country</th>\n", | |
" <th>Surface Type</th>\n", | |
" <th>Court Speed</th>\n", | |
" <th>Prize Money(USD million)</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Australian Open</td>\n", | |
" <td>Australia</td>\n", | |
" <td>Hard Court</td>\n", | |
" <td>Medium</td>\n", | |
" <td>3.20</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>French Open</td>\n", | |
" <td>France</td>\n", | |
" <td>Clay Court</td>\n", | |
" <td>Slow</td>\n", | |
" <td>2.70</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Wimbledon</td>\n", | |
" <td>UK</td>\n", | |
" <td>Grass Court</td>\n", | |
" <td>Fast</td>\n", | |
" <td>2.91</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>US Open</td>\n", | |
" <td>USA</td>\n", | |
" <td>Hard Court</td>\n", | |
" <td>Medium</td>\n", | |
" <td>3.80</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Grand Slam Title Host Country Surface Type Court Speed \\\n", | |
"0 Australian Open Australia Hard Court Medium \n", | |
"1 French Open France Clay Court Slow \n", | |
"2 Wimbledon UK Grass Court Fast \n", | |
"3 US Open USA Hard Court Medium \n", | |
"\n", | |
" Prize Money(USD million) \n", | |
"0 3.20 \n", | |
"1 2.70 \n", | |
"2 2.91 \n", | |
"3 3.80 " | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# View the first few rows of the dataset\n", | |
"\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 4 entries, 0 to 3\n", | |
"Data columns (total 5 columns):\n", | |
"Grand Slam Title 4 non-null object\n", | |
"Host Country 4 non-null object\n", | |
"Surface Type 4 non-null object\n", | |
"Court Speed 4 non-null object\n", | |
"Prize Money(USD million) 4 non-null float64\n", | |
"dtypes: float64(1), object(4)\n", | |
"memory usage: 240.0+ bytes\n", | |
"None\n" | |
] | |
} | |
], | |
"source": [ | |
"# View the summary of dataframe df\n", | |
"\n", | |
"print(df.info())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" Prize Money(USD million)\n", | |
"count 4.000000\n", | |
"mean 3.152500\n", | |
"std 0.477869\n", | |
"min 2.700000\n", | |
"25% 2.857500\n", | |
"50% 3.055000\n", | |
"75% 3.350000\n", | |
"max 3.800000\n" | |
] | |
} | |
], | |
"source": [ | |
"# View the descriptive statistics\n", | |
"\n", | |
"print(df.describe())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Interpretation**\n", | |
"\n", | |
"We can see that the dataframe df contains 5 columns. The columns **Grand Slam Title**, **Host Country**, **Surface Type** and **Court Speed** are of object data types while the column **Prize Money** is of integer data type.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"=====================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 4. Encoding class labels with LabelEncoder\n", | |
"\n", | |
"\n", | |
"The machine learning algorithms require that class labels are encoded as integers. Most estimators for classification convert class labels to integers internally. It is considered a good practice to provide class labels as integer arrays to avoid problems. Scikit-Learn provides a transformer for this task called **LabelEncoder**. \n", | |
"\n", | |
"\n", | |
"Suppose there are three nominal variables x1, x2, x3 given by NumPy array y\n", | |
"\n", | |
"\n", | |
"`y = df[[‘x1', ‘x2’, ‘x3’]].values`\n", | |
"\n", | |
"\n", | |
"The following code has been implemented in Scikit-Learn to transform y into integer values.\n", | |
"\n", | |
"\n", | |
"`from sklearn.preprocessing import LabelEncoder`\n", | |
"\n", | |
"`le = LabelEncoder()`\n", | |
"\n", | |
"`y = le.fit_transform(df[[‘x1’, ‘x2’, ‘x3’]].values)`\n", | |
"\n", | |
"`print(y)`\n", | |
"\n", | |
"The fit_transform method is just a shortcut for calling fit and transform separately. We can use the inverse_transform method to transform the integer class labels back into their original string representation. \n", | |
"\n", | |
"`le.inverse_transform(y)`\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Make a copy of the dataframe df\n", | |
"\n", | |
"df1 = df.copy()\n", | |
"\n", | |
"# I have made a copy of the dataframe df as df1.\n", | |
"\n", | |
"# Now, I will work with df1." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Grand Slam Title</th>\n", | |
" <th>Host Country</th>\n", | |
" <th>Surface Type</th>\n", | |
" <th>Court Speed</th>\n", | |
" <th>Prize Money(USD million)</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Australian Open</td>\n", | |
" <td>Australia</td>\n", | |
" <td>Hard Court</td>\n", | |
" <td>Medium</td>\n", | |
" <td>3.20</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>French Open</td>\n", | |
" <td>France</td>\n", | |
" <td>Clay Court</td>\n", | |
" <td>Slow</td>\n", | |
" <td>2.70</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Wimbledon</td>\n", | |
" <td>UK</td>\n", | |
" <td>Grass Court</td>\n", | |
" <td>Fast</td>\n", | |
" <td>2.91</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>US Open</td>\n", | |
" <td>USA</td>\n", | |
" <td>Hard Court</td>\n", | |
" <td>Medium</td>\n", | |
" <td>3.80</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Grand Slam Title Host Country Surface Type Court Speed \\\n", | |
"0 Australian Open Australia Hard Court Medium \n", | |
"1 French Open France Clay Court Slow \n", | |
"2 Wimbledon UK Grass Court Fast \n", | |
"3 US Open USA Hard Court Medium \n", | |
"\n", | |
" Prize Money(USD million) \n", | |
"0 3.20 \n", | |
"1 2.70 \n", | |
"2 2.91 \n", | |
"3 3.80 " | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# View the first few rows of dataframe df1\n", | |
"\n", | |
"df1.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that the **Court Speed** variable is **Ordinal** variable. I will now encode this **Court Speed** variable into integer values." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Encoded Court Speed column labels are:\n", | |
" [1 2 0 1]\n" | |
] | |
} | |
], | |
"source": [ | |
"# Encode Court Speed column labels into integer values\n", | |
"\n", | |
"from sklearn.preprocessing import LabelEncoder\n", | |
"\n", | |
"le = LabelEncoder()\n", | |
"\n", | |
"y1 = le.fit_transform(df1['Court Speed'].values)\n", | |
"\n", | |
"print(\"Encoded Court Speed column labels are:\\n\", (y1))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Interpretation**\n", | |
"\n", | |
"We can see that the **Court Speed** column which contain **Medium**, **Slow**, **Fast** and **Medium** values are now encoded as 1 2 0 1." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The class labels are:\n", | |
" ['Fast' 'Medium' 'Slow']\n" | |
] | |
} | |
], | |
"source": [ | |
"# Print class labels\n", | |
"\n", | |
"print(\"The class labels are:\\n\", (le.classes_))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Suppress future warnings\n", | |
"\n", | |
"import warnings\n", | |
"\n", | |
"warnings.simplefilter(action = \"ignore\", category = DeprecationWarning)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The inverted original class labels are:\n", | |
" ['Medium' 'Slow' 'Fast' 'Medium']\n" | |
] | |
} | |
], | |
"source": [ | |
"# Invert the encoded class labels to original class labels\n", | |
"\n", | |
"print(\"The inverted original class labels are:\\n\",\n", | |
" le.inverse_transform([1, 2, 0, 1]))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Interpretation**\n", | |
"\n", | |
"We can view the original class labels and inverted class labels with the **le.classes_** and **le.inverse_transform(y)** commands.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 5. Encoding categorical integer labels with OneHotEncoder\n", | |
"\n", | |
"\n", | |
"There is one problem associated with encoding class labels with **LabelEncoder**. Scikit-Learn’s estimator for classification treat class labels as categorical data with no order associated with it. So, we used the LabelEncoder to encode the string labels into integers. The problem arises when we apply the same approach to transform the nominal variable with LabelEncoder.\n", | |
"\n", | |
"\n", | |
"We have seen above that LabelEncoder transform NumPy array y given by\n", | |
"\n", | |
"\n", | |
"`y = df[[‘x1’, ‘x2’, ‘x3’]].values`\n", | |
"\n", | |
"\n", | |
"into integer array values given by\n", | |
"\n", | |
"\n", | |
"`array([0, 1, 2])`\n", | |
"\n", | |
"So, we can map the nominal variables x1, x2, x3 to integer values 0, 1, 2 as follows.\n", | |
"\n", | |
"x1 = 0\n", | |
"\n", | |
"x2 = 1\n", | |
"\n", | |
"x3 = 2\n", | |
"\n", | |
"\n", | |
"Although, there is no order involved with x1, x2, x3, but a learning algorithm will now assume that x1 < x2 < x3. This is wrong assumption and it will not produce desired results. We will see later, how we can solve this problem\n", | |
"\n", | |
"But, first I will convert the nominal feature variable into integer values." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, I will encode the nominal feature variable **Grand Slam Title** into integer values." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0 3.2]\n", | |
" [1 2.7]\n", | |
" [3 2.91]\n", | |
" [2 3.8]]\n" | |
] | |
} | |
], | |
"source": [ | |
"# Encode Grand Slam Title column values into integer values\n", | |
"\n", | |
"X1 = df1[['Grand Slam Title', 'Prize Money(USD million)']].values\n", | |
"\n", | |
"title_le = LabelEncoder()\n", | |
"\n", | |
"X1[:, 0] = title_le.fit_transform(X1[:, 0])\n", | |
"\n", | |
"print(X1)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Interpretation**\n", | |
"\n", | |
"Here the problem arises. We can see that the **Grand Slam Title** column values **Australian Open**, **French Open**, **Wimbledon** and **US Open** are now encoded as 0, 1, 3 and 2.\n", | |
"\n", | |
"So, **Australian Open** is mapped to 0, **French Open** is mapped to 1, **Wimbledon** is mapped to 3 and **US Open** is mapped to 2. So, we can conclude that\n", | |
"\n", | |
"**Australian Open** < **French Open** < **US Open** < **Wimbledon**\n", | |
"\n", | |
"But, this is not true. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To fix this issue, a common solution is to use a technique called **one-hot-encoding**. In this technique, we create a new dummy feature for each unique value in the nominal feature column. The value of the dummy feature is equal to one when the unique value is present and zero otherwise. Similarly, for another unique value, the value of the dummy feature is equal to one when the unique value is present and zero otherwise. This is called one-hot encoding, because only one dummy feature will be equal to one (hot) , while the others will be zero (cold).\n", | |
"\n", | |
"Scikit-Learn provides a **OneHotEncoder** transformer to convert integer categorical values into one-hot vectors. The following code accomplish this task with the NumPy array y –\n", | |
" \n", | |
"`y = df[[‘x1’, ‘x2’, ‘x3’]].values`\n", | |
"\n", | |
"\n", | |
"`from sklearn.preprocessing import OneHotEncoder`\n", | |
"\n", | |
"`ohe = OneHotEncoder()`\n", | |
"\n", | |
"`y = ohe.fit_transform(y).toarray()`\n", | |
"\n", | |
"`print(y)`\n", | |
"\n", | |
"\n", | |
"By default, the output is a SciPy sparse matrix, instead of a NumPy array. This way of output is very useful when we have categorical attributes with thousands of categories. If there are lot of zeros, a sparse matrix only stores the location of the non-zero elements. So, sparse matrices are a more efficient way of storing large datasets. It is supported by many Scikit-Learn functions. \n", | |
"\n", | |
"\n", | |
"To convert the dense NumPy array, we should call the toarray ( ) method. To omit the toarray() step, we could alternatively initialize the encoder as \n", | |
"\n", | |
"\n", | |
"OneHotEncoder( … , sparse = False) \n", | |
"\n", | |
"\n", | |
"to return a regular NumPy array.\n", | |
"\n", | |
"Another way which is more convenient is to create those dummy features via one-hot encoding is to use the pandas.get_dummies() method. The get_dummies() method will only convert string columns and leave all other columns unchanged in a dataframe.\n", | |
"\n", | |
"\n", | |
"`import pandas as pd`\n", | |
"\n", | |
"\n", | |
"`pd.get_dummies([[‘x1’, ‘x2’, ‘x3’]])`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[1. , 0. , 0. , 0. , 3.2 ],\n", | |
" [0. , 1. , 0. , 0. , 2.7 ],\n", | |
" [0. , 0. , 0. , 1. , 2.91],\n", | |
" [0. , 0. , 1. , 0. , 3.8 ]])" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Encode the converted integer values of Grand Slam Title column values into one-hot vectors\n", | |
"\n", | |
"from sklearn.preprocessing import OneHotEncoder\n", | |
"\n", | |
"ohe = OneHotEncoder(categorical_features=[0])\n", | |
"\n", | |
"ohe.fit_transform(X1).toarray()\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Interpretation**\n", | |
"\n", | |
"We can now see that the column values in the **Grand Slam Title** column are now converted into one-hot-vectors. \n", | |
"\n", | |
"In the first row, **Australian Open** is present. So, the dummy variable contains 1 for the **Australian Open** and 0's for the other titles since other titles are not present.\n", | |
"\n", | |
"Similar explanation goes for the other **Grand Slam Title** column values in the second, third and fourth rows.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 6. Encode multi-class labels to binary labels with LabelBinarizer\n", | |
"\n", | |
"\n", | |
"\n", | |
"We can accomplish both the tasks (encoding multi-class labels to integer categories, then from integer categories to one-hot vectors or binary labels) in one shot using the Scikit-Learn’s LabelBinarizer class.\n", | |
"\n", | |
"\n", | |
"We can define a NumPy array y as follows:-\n", | |
"\n", | |
"\n", | |
"`y = df[[‘x1’, ‘x2’, ‘x3’]].values`\n", | |
"\n", | |
"The following code transform y into binary labels using LabelBinarizer\n", | |
"\n", | |
"\n", | |
"`from sklearn.preprocessing import LabelBinarizer`\n", | |
"\n", | |
"`lb = LabelBinarizer()`\n", | |
"\n", | |
"`y = lb.fit_transform(df[[‘x1’, ‘x2’, ‘x3’]].values)`\n", | |
"\n", | |
"`print(y)`\n", | |
"\n", | |
"This returns a dense NumPy array by default. We can get a sparse matrix by passing sparse_output = True to the LabelBinarizer constructor.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Copy the dataframe df2\n", | |
"\n", | |
"df2 = df.copy()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Grand Slam Title</th>\n", | |
" <th>Host Country</th>\n", | |
" <th>Surface Type</th>\n", | |
" <th>Court Speed</th>\n", | |
" <th>Prize Money(USD million)</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Australian Open</td>\n", | |
" <td>Australia</td>\n", | |
" <td>Hard Court</td>\n", | |
" <td>Medium</td>\n", | |
" <td>3.20</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>French Open</td>\n", | |
" <td>France</td>\n", | |
" <td>Clay Court</td>\n", | |
" <td>Slow</td>\n", | |
" <td>2.70</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Wimbledon</td>\n", | |
" <td>UK</td>\n", | |
" <td>Grass Court</td>\n", | |
" <td>Fast</td>\n", | |
" <td>2.91</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>US Open</td>\n", | |
" <td>USA</td>\n", | |
" <td>Hard Court</td>\n", | |
" <td>Medium</td>\n", | |
" <td>3.80</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Grand Slam Title Host Country Surface Type Court Speed \\\n", | |
"0 Australian Open Australia Hard Court Medium \n", | |
"1 French Open France Clay Court Slow \n", | |
"2 Wimbledon UK Grass Court Fast \n", | |
"3 US Open USA Hard Court Medium \n", | |
"\n", | |
" Prize Money(USD million) \n", | |
"0 3.20 \n", | |
"1 2.70 \n", | |
"2 2.91 \n", | |
"3 3.80 " | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# View the first few rows of df2\n", | |
"\n", | |
"df2.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Suppose the Prize Money column in dataframe df2 is given as a list.\n", | |
"\n", | |
"Then we can use LabelBinarizer to encode the Prize Money values into one-hot-vectors." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Slice the Prize Money column\n", | |
"\n", | |
"X2 = df2.iloc[:, 4]\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 3.20\n", | |
"1 2.70\n", | |
"2 2.91\n", | |
"3 3.80\n", | |
"Name: Prize Money(USD million), dtype: float64" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X2.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that the data type of X2 is float 64. We need to convert it into integer." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Change the data type of X2 to integer 64\n", | |
"\n", | |
"X2 = X2.astype('int64')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"dtype('int64')" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Check the data type of X2\n", | |
"\n", | |
"X2.dtype" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[1]\n", | |
" [0]\n", | |
" [0]\n", | |
" [1]]\n" | |
] | |
} | |
], | |
"source": [ | |
"# Encode X2 into one-hot-vectors with LabelBinarizer\n", | |
"\n", | |
"from sklearn.preprocessing import LabelBinarizer\n", | |
"\n", | |
"lb = LabelBinarizer()\n", | |
"\n", | |
"y2 = lb.fit_transform(X2)\n", | |
"\n", | |
"print(y2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that the Prize Money column is converted into one-hot-vectors. \n", | |
"\n", | |
"Now, I will check its classes." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The class labels are:\n", | |
" [2 3]\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"The class labels are:\\n\", lb.classes_)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Interpretation**\n", | |
"\n", | |
"The class labels of encoded values are 2 and 3. So, the values less than 3 are encoded as 0 and the values more than 3 are encoded as 1.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"===================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 7. Encoding list of dictionaries with DictVectorizer\n", | |
"\n", | |
"\n", | |
"\n", | |
"We have previously seen that we can use OneHotEncoder transformer to convert integer categorical values into one-hot vectors. But, when the data comes as a list of dictionaries, we can use Scikit-Learn's DictVectorizer transformer to do the same job for us. \n", | |
"\n", | |
"DictVectorizer will only do a binary one-hot encoding when feature values are of type string.\n", | |
"\n", | |
"\n", | |
"Suppose there is a list of dictionaries given by y as follows:-\n", | |
"\n", | |
"\n", | |
"`y = df[ {‘foo1’ : x1}, `\n", | |
" `{‘foo2’ : x2},`\n", | |
" `{‘foo3’ : x3}].`\n", | |
"\n", | |
"\n", | |
"We can use DictVectorizer to do a binary one-hot encoding as follows:-\n", | |
"\n", | |
"\n", | |
"`from sklearn.preprocessing import DictVectorizer`\n", | |
"\n", | |
"`dv = DictVectorizer (sparse = False)`\n", | |
"\n", | |
"`X_dv = dv.fit_transform(y)`\n", | |
"\n", | |
"`print(X_dv)`\n", | |
"\n", | |
"With these categorical features thus encoded, we can proceed as normal with fitting a Scikit-Learn model.\n", | |
"\n", | |
"To see the meaning of each column, we can inspect the feature names as follows:-\n", | |
"\n", | |
"\n", | |
"`dv.get_feature_names()`\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[3.2 1. 0. 0. 0. ]\n", | |
" [2.7 0. 1. 0. 0. ]\n", | |
" [2.91 0. 0. 0. 1. ]\n", | |
" [3.8 0. 0. 1. 0. ]]\n" | |
] | |
} | |
], | |
"source": [ | |
"tennis_df = [ \n", | |
" {'title': 'Australian Open', 'prize money': 3.20},\n", | |
" {'title': 'French Open', 'prize money': 2.70},\n", | |
" {'title': 'Wimbledon', 'prize money': 2.91},\n", | |
" {'title': 'US Open', 'prize money': 3.80}\n", | |
" ]\n", | |
"\n", | |
"# We can use DictVectorizer to do a binary one-hot encoding\n", | |
"\n", | |
"from sklearn.feature_extraction import DictVectorizer\n", | |
"\n", | |
"dv = DictVectorizer(sparse = False)\n", | |
"\n", | |
"X_dv = dv.fit_transform(tennis_df)\n", | |
"\n", | |
"print(X_dv)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The feature names of tennis_df data structure are:\n", | |
" ['prize money', 'title=Australian Open', 'title=French Open', 'title=US Open', 'title=Wimbledon']\n" | |
] | |
} | |
], | |
"source": [ | |
"# inspect the feature names\n", | |
"\n", | |
"print(\"The feature names of tennis_df data structure are:\\n\" , dv.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 8. Converting text document to word count vectors with CountVectorizer\n", | |
"\n", | |
"\n", | |
"We cannot work directly with text data when using machine learning algorithms. Instead, we need to convert the text to numerical representation. Algorithms take numbers as input for further analysis. So, we need to convert text documents to vectors of numbers.\n", | |
"\n", | |
"\n", | |
"A simple yet effective model for representing text documents in machine learning is called the Bag-of-Words Model, or BoW. It focusses on the occurrence of words in a document. The Scikit-Learn’s CountVectorizer transformer is designed for representing \"bag-of-words\" technique. CountVectorizer takes the text data as input and count the occurrences of each word within it. The result is a sparse matrix recording the number of times each word appears.\n", | |
"\n", | |
"\n", | |
"For example, consider the following sample text data:-\n", | |
"\n", | |
"`corpus = [‘dog’, \n", | |
"\t ‘cat’\n", | |
"\t‘dog chases cat’]`\n", | |
" \n", | |
"\n", | |
"We can use CountVectorizer to convert data as follows:-\n", | |
"\n", | |
"\n", | |
"`from sklearn.feature_extraction.text import CountVectorizer`\n", | |
"\n", | |
"`cv = CountVectorizer ()`\n", | |
"\n", | |
"`data = cv.fit_transform(corpus)`\n", | |
"\n", | |
"\n", | |
"We can inspect the feature names and view the transformed data as follows:-\n", | |
"\n", | |
"\n", | |
"`print(cv.get_feature_names())`\n", | |
"\n", | |
"\n", | |
"`print(data.toarray())`\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0 0 1]\n", | |
" [1 0 0]\n", | |
" [1 1 1]]\n" | |
] | |
} | |
], | |
"source": [ | |
"# View the transformed data\n", | |
"\n", | |
"corpus = ['dog',\n", | |
" 'cat',\n", | |
" 'dog chases cat']\n", | |
"\n", | |
"\n", | |
"# Use CountVectorizer to convert data\n", | |
"\n", | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"\n", | |
"cv = CountVectorizer()\n", | |
"\n", | |
"X_cv = cv.fit_transform(corpus)\n", | |
"\n", | |
"print(X_cv.toarray())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The feature names of the text document corpus are:\n", | |
" ['cat', 'chases', 'dog']\n" | |
] | |
} | |
], | |
"source": [ | |
"# Inspect the feature names\n", | |
"\n", | |
"print(\"The feature names of the text document corpus are:\\n\", cv.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"=======================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 9. Converting text document to word frequency vectors with TfidfVectorizer\n", | |
"\n", | |
"\n", | |
"There is one problem associated with the above approach of converting text document to word count vectors with CountVectorizer. The raw word counts result in features which put too much emphasis on words that appear frequently. This cannot produce desired results in some classification algorithms. \n", | |
"\n", | |
"\n", | |
"A solution to the above problem is to calculate word frequencies. We can use Scikit-Learn’s Tfidf transformer to calculate word frequencies. It is commonly called as TF-IDF. TF-IDF stands for **Term Frequency – Inverse Document Frequency**. TF-IDF weights the word counts by a measure of how often they appear in the documents. \n", | |
"\n", | |
"\n", | |
"The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer (described below).\n", | |
"\n", | |
"The syntax for computing TF-IDF features is given below:-\n", | |
"\n", | |
"\n", | |
"`from sklearn.feature_extraction.text import TfidfVectorizer`\n", | |
"\n", | |
"`corpus = [‘dog’,` \n", | |
"\n", | |
"`‘cat’`\n", | |
" \n", | |
"`‘dog chases cat’]`\n", | |
"\n", | |
"`vec = TfidfVectorizer()`\n", | |
"\n", | |
"`X = vec.fit_transform(corpus)`\n", | |
"\n", | |
"`print(X.toarray())`\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0. 0. 1. ]\n", | |
" [1. 0. 0. ]\n", | |
" [0.51785612 0.68091856 0.51785612]]\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"\n", | |
"corpus = ['dog',\n", | |
"\n", | |
" 'cat',\n", | |
"\n", | |
" 'dog chases cat']\n", | |
"\n", | |
"vec1 = TfidfVectorizer()\n", | |
"\n", | |
"X_tfv = vec1.fit_transform(corpus)\n", | |
"\n", | |
"print(X_tfv.toarray())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can inspect the feature names with the following command.\n", | |
"\n", | |
"print(vec1.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['cat', 'chases', 'dog']\n" | |
] | |
} | |
], | |
"source": [ | |
"# get feature names\n", | |
"\n", | |
"print(vec1.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"====================================================================================================================" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 10. Transforming a counted matrix to normalized tf-idf representation with \n", | |
"\n", | |
"## TfidfTransformer\n", | |
"\n", | |
"\n", | |
"We have previously seen that CountVectorizer takes the text data as input and count the occurrences of each \n", | |
"word within it. The result is a sparse matrix recording the number of times each word appears.\n", | |
"\n", | |
"If we already have such a matrix, we can use it with a TfidfTransformer to calculate the inverse \n", | |
"document frequencies (idf) and start encoding documents.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0. 0. 1. ]\n", | |
" [1. 0. 0. ]\n", | |
" [0.51785612 0.68091856 0.51785612]]\n" | |
] | |
} | |
], | |
"source": [ | |
"# Calculate inverse document frequencies (idf)\n", | |
"\n", | |
"from sklearn.feature_extraction.text import TfidfTransformer\n", | |
"\n", | |
"vec2 = TfidfTransformer()\n", | |
"\n", | |
"X_tft = vec2.fit_transform(X_cv)\n", | |
"\n", | |
"print(X_tft.toarray())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This concludes our discussion on text and categorical data." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.0" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment