Skip to content

Instantly share code, notes, and snippets.

@firmai
Last active February 28, 2022 20:07
Show Gist options
  • Save firmai/7671297529b170d23379b2b1ba1eb9af to your computer and use it in GitHub Desktop.
Save firmai/7671297529b170d23379b2b1ba1eb9af to your computer and use it in GitHub Desktop.
a-categorical-encoding-guide.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/firmai/7671297529b170d23379b2b1ba1eb9af/categorical-encoding-guide.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KH5pfnatm_KF"
},
"source": [
"# A Guide for Applying Categorical Encoding Methods\n",
"\n",
"In this notebook, we will be investigating the most common approaches to categorical encoding and how/when to apply them."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F-pIvyEem_KG"
},
"source": [
"## Introduction\n",
"\n",
"In applied machine learning, the two most common types of structured data are numeric data (such as `age`: 10, 17, 25) and categorical data (such as `color`: red, blue, green). \n",
"\n",
"It is often easier to deal with numeric data compared to categorical data, because machine learning models typically handle mathematical vectors--numeric data can therefore be much more directly applied.\n",
"\n",
"However, machine learning algorithms cannot work directly with categorical data as they do not have intrinsic mathematical relations. As a result, we must do some amount of work on the data before being able to use it in machine learning--the methods of turning categorical data into usable, mathematical data is called categorical encoding.\n",
"\n",
"In this notebook, we will look at some of the most common approaches to categorical encoding."
]
},
{
"cell_type": "markdown",
"source": [
"Categorical Encoding\n",
"Knowing Label encoding and One-Hot encoding is enough for 99% of tasks. We will make use of the featuretools library, it is a bit old, so I would recommend you use sklearn's libarary instead. Here you can see [generally](https://github.com/scikit-learn-contrib/category_encoders/blob/master/examples/benchmarking_large/output/auc_model.pdf) the results for each type of encoding. "
],
"metadata": {
"id": "O-06SnU0xcwp"
}
},
{
"cell_type": "code",
"source": [
"%%capture \n",
"!pip install categorical-encoding\n",
"!pip install featuretools==0.14.0"
],
"metadata": {
"id": "UdnJ3gRUqO-d"
},
"execution_count": 1,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import featuretools as ft\n",
"import categorical_encoding as ce\n",
"from featuretools.tests.testing_utils import make_ecommerce_entityset\n",
"\n",
"\n",
"def create_feature_matrix():\n",
" es = make_ecommerce_entityset()\n",
" f1 = ft.Feature(es[\"log\"][\"product_id\"])\n",
" f2 = ft.Feature(es[\"log\"][\"value\"])\n",
" features = [f1, f2]\n",
" ids = [0, 1, 2, 3, 4, 5]\n",
" feature_matrix = ft.calculate_feature_matrix(features, es, instance_ids=ids)\n",
"\n",
" return feature_matrix, features, f1, f2, es, ids"
],
"metadata": {
"id": "evyVIjtOru9m"
},
"execution_count": 15,
"outputs": []
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "G_-RPVBam_KH"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import categorical_encoding as ce"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "AwZeBryOm_KI"
},
"outputs": [],
"source": [
"pd.options.display.float_format = '{:.2f}'.format #increase readability\n",
"feature_matrix, features, f1, f2, es, ids = create_feature_matrix() #load in data for demos"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qNLO_A6Rm_KI"
},
"source": [
"## Identifying as Nominal vs. Categorical Data\n",
"\n",
"#### Ordinal Data\n",
"\n",
"Ordinal data are when the values within the category take on a meaningful ordering.\n",
"\n",
"Examples of this include t-shirt sizes (`XS`, `S`, `M`, `L`, `XL`), survey opinions (`strongly dislike`, `dislike`, `like`, `strongly like`), or socieconomic status/income categories (`0-$50000`, `$50000-$100000`, `$100000+`).\n",
"\n",
"#### Nominal Data\n",
"Nominal data have no meaningful ordering.\n",
"\n",
"Examples of this include US States (`California`, `Massachusetts`, `New York`...), music genres (`Classical`, `Hip-hop`, `Jazz`...), or cuisine types (`Chinese`, `Italian`, `Tex-Mex`...)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2bp2GDGqm_KJ"
},
"source": [
"## Classic Encoders\n",
"\n",
"These encompass a broad range of encoders that are the most straightfoward and easiest to understand, making them very useful and popular among ML practioners.\n",
"\n",
"### Ordinal/Label Encoding\n",
"In ordinal encoding, each string value is assigned a whole number specific to that value--the first unique value becomes 1, the second becomes 2, and so on."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HKIQ-WcYm_KJ"
},
"source": [
"As a quick example, our data will initially look like this."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "8dYRjMjom_KK",
"outputId": "b42b27e5-f591-47bb-e5cd-c898cc4cd781",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 269
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-8b101a5e-b3bf-4bc3-85f3-23f12e5a8fe9\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>product_id</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>coke zero</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>coke zero</td>\n",
" <td>5.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>coke zero</td>\n",
" <td>10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>car</td>\n",
" <td>15.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>car</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>toothpaste</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8b101a5e-b3bf-4bc3-85f3-23f12e5a8fe9')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-8b101a5e-b3bf-4bc3-85f3-23f12e5a8fe9 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-8b101a5e-b3bf-4bc3-85f3-23f12e5a8fe9');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" product_id value\n",
"id \n",
"0 coke zero 0.00\n",
"1 coke zero 5.00\n",
"2 coke zero 10.00\n",
"3 car 15.00\n",
"4 car 20.00\n",
"5 toothpaste 0.00"
]
},
"metadata": {},
"execution_count": 18
}
],
"source": [
"feature_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1AKMTVzmm_KL"
},
"source": [
"After fitting the Ordinal Encoder, it looks like this:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "NKN2Bkilm_KM",
"outputId": "693c3231-f466-444f-a3f3-d37239354def",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 269
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-9fda2840-62fb-4fb4-ba1a-5bdd02c8b7c8\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PRODUCT_ID_ordinal</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>5.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>15.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>3</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9fda2840-62fb-4fb4-ba1a-5bdd02c8b7c8')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-9fda2840-62fb-4fb4-ba1a-5bdd02c8b7c8 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-9fda2840-62fb-4fb4-ba1a-5bdd02c8b7c8');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" PRODUCT_ID_ordinal value\n",
"id \n",
"0 1 0.00\n",
"1 1 5.00\n",
"2 1 10.00\n",
"3 2 15.00\n",
"4 2 20.00\n",
"5 3 0.00"
]
},
"metadata": {},
"execution_count": 19
}
],
"source": [
"ce_ord = ce.Encoder(method='ordinal')\n",
"ce_ord.fit_transform(feature_matrix, features)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3bJNlLkcm_KM"
},
"source": [
"Ordinal Encoding can be useful in niche cases, namely for **interval data**. For example, if we had t-shirt sizes `[S,M,L]`, we could map them to `[1,2,3]` because t-shirt sizes follow a logical, evenly incrementing order.\n",
"\n",
"However, keeping the data like this is usually not recommended, especially if the data values do not follow a regularly increasing order. Machine Learning algorithms cannot differentiate between categorical and numeric data and thus will infer an ordering that may be incorrect.\n",
"\n",
"Thus, ordinal encoding is often less useful on its own. Instead, many encoders, such as [sklearn's OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) require data to be in a numeric format before the encoder can be applied. Then, most will use Ordinal Encoding as a first step before applying other encoders.\n",
"\n",
"To alleviate this concern, Featuretools' categorical-encoding library's default encoders support direct encoding without having to first apply ordinal encoding."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7I2pEbypm_KM"
},
"source": [
"### OneHot/Dummy Encoding\n",
"\n",
"One-hot encoding is the go-to approach for categorical encoding due to its ease to use/understand, versatility, and accuracy. \n",
"\n",
"One-hot encoding works by creating a new column for each value. For each new column, a 1 is assigned if the row contains that column's value and a 0 otherwise."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "xXBW1PWbm_KM",
"outputId": "61adf5ea-6cea-46aa-dde6-daced22b595b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 323
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead\n",
" elif pd.api.types.is_categorical(cols):\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-8b9510b7-e78d-48af-9129-f49b9a962f30\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>product_id = coke zero</th>\n",
" <th>product_id = car</th>\n",
" <th>product_id = toothpaste</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>5.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>15.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8b9510b7-e78d-48af-9129-f49b9a962f30')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-8b9510b7-e78d-48af-9129-f49b9a962f30 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-8b9510b7-e78d-48af-9129-f49b9a962f30');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" product_id = coke zero product_id = car product_id = toothpaste value\n",
"id \n",
"0 1 0 0 0.00\n",
"1 1 0 0 5.00\n",
"2 1 0 0 10.00\n",
"3 0 1 0 15.00\n",
"4 0 1 0 20.00\n",
"5 0 0 1 0.00"
]
},
"metadata": {},
"execution_count": 20
}
],
"source": [
"ce_one_hot = ce.Encoder(method=\"one_hot\")\n",
"ce_one_hot.fit_transform(feature_matrix, features)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kAEWeEtam_KN"
},
"source": [
"One-hot encoding typically performs very well, and Featuretools' built-in `encode_features` features utilizes this. However, it has one major drawback.\n",
"\n",
"The number of new features generated is equal to the number of unique values, which leads to severe memory issues with high cardinality datasets.\n",
"\n",
"To illustrate, imagine if our data included 1000 unique products rather than 3. Then, we could go from our initial singular column to 1000 columns, one for each unique value. \n",
"\n",
"With so many added columns, memory issues can become a serious concern if coupled with many rows. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-Z5n1a8mm_KN"
},
"source": [
"### Hashing Encoding\n",
"\n",
"Hashing Encoding also serves as a lower-dimensionality alternative to One-Hot encoding. Hashing Encoders employ the [hashing trick](https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f), which you can also read more about [here](https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087).\n",
"\n",
"Hashing Encoders use a hashing algorithm to map category values to numeric values, which are then split into correspoding columns accordingly."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "V8sLJ7V2m_KO",
"outputId": "e19f8113-7f03-4e93-9033-34b9f4164f74",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" PRODUCT_ID_hashing[0] PRODUCT_ID_hashing[1] ... PRODUCT_ID_hashing[7] value\n",
"id ... \n",
"0 0 0 ... 0 0.00\n",
"1 0 0 ... 0 5.00\n",
"2 0 0 ... 0 10.00\n",
"3 0 1 ... 0 15.00\n",
"4 0 1 ... 0 20.00\n",
"5 0 0 ... 0 0.00\n",
"\n",
"[6 rows x 9 columns]\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead\n",
" elif pd.api.types.is_categorical(cols):\n"
]
}
],
"source": [
"ce_hash = ce.Encoder(method='hashing')\n",
"print(ce_hash.fit_transform(feature_matrix, features))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "G6GhgyrSm_KO"
},
"source": [
"The number of produced columns is a controllable parameter and can be set to be less than the number of unique values, meaning less total columns than one-hot encoding. The specific hashing algorithm is also controllable (default is `_md5_`).\n",
"\n",
"Hashing Encoding presents its own unique challenge in the forming of collisions, but this does not usually result in problems unless there is significant overlap.\n",
"\n",
"Overall, Hashing Encoding is another viable alternative in the case that one-hot encoding leads to dimensionality issues."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mu9vN_4Km_KO"
},
"source": [
"## Bayesian Encoders\n",
"\n",
"Bayesian Encoders are different from Classic Encoders in that they use information from a dependent variable as well. They output only one column and thus eliminates any concern regarding high dimensionality."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hJQ49wB2m_KO"
},
"source": [
"### Target Encoding\n",
"\n",
"Target Encoding replaces each specific category value with a weighted average of the dependent variable."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "9JLxp28Am_KO",
"outputId": "16beb2f3-4bcb-41bd-9799-d04b0cc9e71d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 323
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead\n",
" elif pd.api.types.is_categorical(cols):\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-8f606eca-8854-486d-977e-8e16d337374a\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PRODUCT_ID_target</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5.40</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5.40</td>\n",
" <td>5.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5.40</td>\n",
" <td>10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>15.03</td>\n",
" <td>15.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>15.03</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>8.33</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8f606eca-8854-486d-977e-8e16d337374a')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-8f606eca-8854-486d-977e-8e16d337374a button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-8f606eca-8854-486d-977e-8e16d337374a');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" PRODUCT_ID_target value\n",
"id \n",
"0 5.40 0.00\n",
"1 5.40 5.00\n",
"2 5.40 10.00\n",
"3 15.03 15.00\n",
"4 15.03 20.00\n",
"5 8.33 0.00"
]
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"ce_targ = ce.Encoder(method='target')\n",
"ce_targ.fit_transform(feature_matrix, features, feature_matrix['value'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WnPbnBKXm_KP"
},
"source": [
"The primary concern with Target Encoding is overfitting/response leakage.\n",
"\n",
"For example, if we were faced with the task of predicting `value` from `PRODUCT_ID_target`, information about `value` would have already been leaked via our number for `PRODUCT_ID_target`. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y-3C59QHm_KP"
},
"source": [
"### LeaveOneOut Encoding\n",
"\n",
"LeaveOneOut Encoding is identical to TargetEncoding except it handles Target Encoding's problems with overfitting/response leakage.\n",
"\n",
"In LeaveOneOut Encoding, the row in question leaves its own value out when calculating the mean."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "Y4-Mdzgdm_KP",
"outputId": "0c52e1c6-a087-46fa-f327-21b8fd586748",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 269
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" <div id=\"df-12a2fe77-e653-4520-adcb-586fa17beb70\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PRODUCT_ID_leave_one_out</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>7.50</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5.00</td>\n",
" <td>5.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.50</td>\n",
" <td>10.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>20.00</td>\n",
" <td>15.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>15.00</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>8.33</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-12a2fe77-e653-4520-adcb-586fa17beb70')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-12a2fe77-e653-4520-adcb-586fa17beb70 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-12a2fe77-e653-4520-adcb-586fa17beb70');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
" PRODUCT_ID_leave_one_out value\n",
"id \n",
"0 7.50 0.00\n",
"1 5.00 5.00\n",
"2 2.50 10.00\n",
"3 20.00 15.00\n",
"4 15.00 20.00\n",
"5 8.33 0.00"
]
},
"metadata": {},
"execution_count": 23
}
],
"source": [
"ce_leave = ce.Encoder(method='leave_one_out')\n",
"ce_leave.fit_transform(feature_matrix, features, feature_matrix['value'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q_nTVz2Im_KP"
},
"source": [
"Notice how each row has a different value because it does not include its own value in calculating the mean. This reduces label leakage, and, with a more substantial number of rows, the calculated mean should not vary greatly from category to category.\n",
"\n",
"LeaveOneOut Encoding has no real drawbacks, but keep in mind that train/test data must be split before applying the encoder. Otherwise, information from the test data will leak into the training data."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Zwuh0SGcm_KP"
},
"source": [
"## Alternative Encoders\n",
"\n",
"The aforementioned encoders are the most commonly employed by machine learning practitioners, but other encoders exist for niche situations. We will run through several of them quickly.\n",
"\n",
"### Additional Bayesian Encoders\n",
"\n",
"#### Weights of Evidence\n",
"\n",
"Weights of Evidence (WoE) tells the predictive power of an independent variable in relation to the dependent variable through the formula: $$\\text{WoE} = \\ln{\\frac{\\text{Distribution of non-events}}{\\text{Distribution of events}}}.$$\n",
"\n",
"WOE is especially useful in certain cases because similar WOE's imply similar categories, which could help with the accuracy of a machine learning algorithm.\n",
"\n",
"Read more about WoE [here](https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vYdmKdbkm_KP"
},
"source": [
"#### James-Stein\n",
"\n",
"The James-Stein estimator returns a weighted average of the global mean and of the local mean (specific to the particular category value).\n",
"\n",
"This estimator was only designed for normal distributions. Read more about it [here](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dSy7yHjXm_KP"
},
"source": [
"#### M-estimator\n",
"\n",
"The M-Estimator performs similarly to TargetEncoding. Read more about it [here](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wp80jr5bm_KP"
},
"source": [
"### Contrast Encoders\n",
"\n",
"Contrast Encoders uses mathematical operations to capture differences/patterns between categories and order them accordingly.\n",
"\n",
"Some do not advise using Contrast encoders as a go-to as they produce a large number of output columns and generally do not outperform other encoders. However, in certain cases where categories follow a defined mathematical pattern, contrast encoders could offer better performance. \n",
"\n",
"Read this [guide](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) to better understand the calculations behind the encoders. \n",
"\n",
"#### Helmert Encoding\n",
"\n",
"Compares the mean of the dependent variable for a specific value to the mean of the dependent variable over all of the previous values.\n",
"\n",
"#### Sum (Deviation) Encoding\n",
"\n",
"Sum Encoding works the same as Helmert encoding except it compares the mean of the dependent variable to the overall mean over all of the levels instead of just the previous values.\n",
"\n",
"#### Backward Difference\n",
"\n",
"Similar to the previous two except the mean of the dependent variable is compared with the mean of only one level (the prior level).\n",
"\n",
"#### Polynomial Difference\n",
"\n",
"Polynomial encoding looks for linear, quadratic, cubic, or any degree trends. Interval Data, as mentioned earlier for Ordinal Encoding, is a specific subset of this (the values linearly increase)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c_Vd5vt5m_KQ"
},
"source": [
"## Summary\n",
"\n",
"The go-to categorical encoding method should be one-hot encoding or label encoding (tree based models) in nearly every scenario. They are straightforward to apply and typically performs well.\n",
"\n",
"However, in the cases where one-hot encoding leads to memory issues, it is sometimes necessary to look to other encoders. Ordinal, Binary, Hashing, and Target encoders are all possible alternatives, although each presents its own unique set of benefits and drawbacks.\n",
"\n",
"Another go-to method should be LeaveOneOut Encoding. It solves the memory issues that One-Hot Encoding raises, does not have the same concerns over response leakage as Target Encoding, and performs well with very little drawback in almost every situation.\n",
"\n",
"Finally, Contrast Encoders provide an interesting way to mathematically separate categories and determine patterns. However, there are many concerns with contrast encoders, chiefly with its resulting high dimensionality issues as well as its lack of universal performance.\n",
"\n",
"All in all, categorical encoding is an essential step in feature engineering/machine learning, and picking the correct method can be challenging. However, you should always feel free to test multiple categorical encoding methods and pick the one that yields in the best performance--this guide serves as a starting point to pick the right one."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"colab": {
"name": "a-categorical-encoding-guide.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment