Skip to content

Instantly share code, notes, and snippets.

@firmai
Last active February 26, 2024 21:19
Show Gist options
  • Save firmai/c984629eb84c87127bd2d0cb7946969d to your computer and use it in GitHub Desktop.
Save firmai/c984629eb84c87127bd2d0cb7946969d to your computer and use it in GitHub Desktop.
Financial Complaints Classification.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/firmai/c984629eb84c87127bd2d0cb7946969d/financial-complaints-classification.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aiVrbpKHdhYZ"
},
"source": [
"Text Classification"
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "GtBFFJyLSaPn"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "HUDYuepddhYf"
},
"source": [
"It is estimated that 80% of all data is unstructured. Unstructured data is the messy stuff every quantitative analyst tries to traditionally stay away from. It can include images of accidents, text notes of loss adjusters, social media comments, claim documents and reviews of medical doctors etc. How can actuaries make use of these kinds of data to add value to the insurer and what techniques are available for handling these types of data? <br><br>\n",
"In the insurance industry, text data appears everywhere but is generally more prevalent in the marketing, sales and claims. Listed below are some of the possible areas in which an insurer can benefit from text data analytics: <br><br>\n",
"* **General Insurance**\n",
" * Sentiment analysis from customer feedback\n",
" * Chatbots for product recommendations and customer service\n",
" * Automation of claims management process\n",
"* **Life Insurance**\n",
" * Increase accuracy of underwriting process with the use of context analysis from social media platforms\n",
" * Improved customer service through timely responses on coverage, billing ect. especially with the massive library of PDSes\n",
"* **Investments**\n",
" * Recommendation systems based on risk appetite identification from client conversations\n",
"\n",
"<br>\n",
"In this article, we are going to be looking at one of the topics within Natural Language Processing: Text Classification. The way we are going to handle this problem can be split into 3 distinct parts: <br>\n",
"1. Importing and cleaning the dataset <br>\n",
"2. Transforming text to numerical features <br>\n",
"3. Classifying the complaints using supervised learning techniques"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VGRcKj9ydhYg"
},
"source": [
"### Tools and Packages"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZSii7CmmdhYg"
},
"source": [
"The article uses [Python3](https://www.python.org/) and the main packages that we will be using are listed below:\n",
"* [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/) for general data manipulation\n",
"* [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) for general data visualisation\n",
"* [Sci-kit](http://scikit-learn.org/stable/) learn packages for both feature extraction and classification model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:11:13.909398Z",
"start_time": "2018-11-16T08:11:09.586504Z"
},
"id": "KTNZXQRydhYh"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import warnings\n",
"import pyarrow.parquet as pq\n",
"from collections import defaultdict\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn import metrics"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2uBqqyugdhYi"
},
"source": [
"### Run the code from your browser"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Pd_9bOPidhYk"
},
"source": [
"## Importing and Cleaning Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "70TizekNdhYl"
},
"source": [
"The dataset we are using here consists of information regarding finance related complaints that a company has received from its customers.\n",
"It is provided by an open source data library managed by the U.S. General Services Administration and can be downloaded [here](https://catalog.data.gov/dataset/consumer-complaint-database).\n",
"<br><br>\n",
"Since we are trying to predict the category of products based on the complaints received, we can ignore the rest of the columns for the purposes of this exercise and only focus at 2 of them, in particular:\n",
"* Description - Narrative of customer's complaint\n",
"* Product - The category of financial products which the complaint relates to\n",
"<br><br>\n",
"\n",
"We will also get rid of null entries as they will not be of any use to us."
]
},
{
"cell_type": "code",
"source": [
"!wget https://files.consumerfinance.gov/ccdb/complaints.csv.zip"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zDiyE31Rd2Mb",
"outputId": "e17c19fe-3091-4222-a36e-2707580a5afb"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2023-10-23 10:21:55-- https://files.consumerfinance.gov/ccdb/complaints.csv.zip\n",
"Resolving files.consumerfinance.gov (files.consumerfinance.gov)... 13.35.116.98, 13.35.116.61, 13.35.116.20, ...\n",
"Connecting to files.consumerfinance.gov (files.consumerfinance.gov)|13.35.116.98|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 659978547 (629M) [binary/octet-stream]\n",
"Saving to: ‘complaints.csv.zip’\n",
"\n",
"complaints.csv.zip 100%[===================>] 629.40M 94.2MB/s in 6.7s \n",
"\n",
"2023-10-23 10:22:01 (94.5 MB/s) - ‘complaints.csv.zip’ saved [659978547/659978547]\n",
"\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"from zipfile import ZipFile\n",
"\n",
"with ZipFile('complaints.csv.zip', 'r') as zipObj:\n",
" # Extract all the contents of zip file in current directory\n",
" zipObj.extractall()"
],
"metadata": {
"id": "JHV5GYizd-pn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"## grabbing 1% of the data\n",
"import pandas as pd\n",
"import random\n",
"random.seed(10)\n",
"\n",
"p= 0.01\n",
"df = pd.read_csv(\"complaints.csv\",skiprows=lambda i: i>0 and random.random() > p)"
],
"metadata": {
"id": "WsUVPiDqeJnt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 712
},
"id": "twY8lccthrkq",
"outputId": "5d2bd255-b7fc-4eb8-89fb-2c89682bb070"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Date received Product \\\n",
"0 2023-10-06 Credit reporting or other personal consumer re... \n",
"1 2023-10-05 Credit reporting or other personal consumer re... \n",
"2 2023-10-19 Debt collection \n",
"3 2023-10-06 Checking or savings account \n",
"4 2023-10-05 Credit reporting or other personal consumer re... \n",
"\n",
" Sub-product Issue \\\n",
"0 Credit reporting Problem with a company's investigation into an... \n",
"1 Credit reporting Incorrect information on your report \n",
"2 Medical debt Attempts to collect debt not owed \n",
"3 Checking account Managing an account \n",
"4 Credit reporting Incorrect information on your report \n",
"\n",
" Sub-issue \\\n",
"0 Their investigation did not fix an error on yo... \n",
"1 Information belongs to someone else \n",
"2 Debt was result of identity theft \n",
"3 Banking errors \n",
"4 Information belongs to someone else \n",
"\n",
" Consumer complaint narrative \\\n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 XX/XX/2023 Wells Fargo own Transaction record ... \n",
"4 NaN \n",
"\n",
" Company public response \\\n",
"0 NaN \n",
"1 NaN \n",
"2 Company has responded to the consumer and the ... \n",
"3 Company has responded to the consumer and the ... \n",
"4 NaN \n",
"\n",
" Company State ZIP code Tags \\\n",
"0 LD Holdings Group, LLC CA 92563 NaN \n",
"1 Experian Information Solutions Inc. CT 06112 NaN \n",
"2 Ability Recovery Services, LLC TX 76210 NaN \n",
"3 WELLS FARGO & COMPANY NV 89084 NaN \n",
"4 Experian Information Solutions Inc. TX 75212 NaN \n",
"\n",
" Consumer consent provided? Submitted via Date sent to company \\\n",
"0 NaN Web 2023-10-06 \n",
"1 Other Web 2023-10-05 \n",
"2 Consent not provided Web 2023-10-19 \n",
"3 Consent provided Web 2023-10-06 \n",
"4 NaN Web 2023-10-05 \n",
"\n",
" Company response to consumer Timely response? Consumer disputed? \\\n",
"0 In progress Yes NaN \n",
"1 In progress Yes NaN \n",
"2 Closed with explanation Yes NaN \n",
"3 Closed with explanation Yes NaN \n",
"4 In progress Yes NaN \n",
"\n",
" Complaint ID \n",
"0 7658108 \n",
"1 7648067 \n",
"2 7728101 \n",
"3 7649539 \n",
"4 7652371 "
],
"text/html": [
"\n",
" <div id=\"df-242792ea-91a3-4d79-adee-8e0e991a242d\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date received</th>\n",
" <th>Product</th>\n",
" <th>Sub-product</th>\n",
" <th>Issue</th>\n",
" <th>Sub-issue</th>\n",
" <th>Consumer complaint narrative</th>\n",
" <th>Company public response</th>\n",
" <th>Company</th>\n",
" <th>State</th>\n",
" <th>ZIP code</th>\n",
" <th>Tags</th>\n",
" <th>Consumer consent provided?</th>\n",
" <th>Submitted via</th>\n",
" <th>Date sent to company</th>\n",
" <th>Company response to consumer</th>\n",
" <th>Timely response?</th>\n",
" <th>Consumer disputed?</th>\n",
" <th>Complaint ID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2023-10-06</td>\n",
" <td>Credit reporting or other personal consumer re...</td>\n",
" <td>Credit reporting</td>\n",
" <td>Problem with a company's investigation into an...</td>\n",
" <td>Their investigation did not fix an error on yo...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>LD Holdings Group, LLC</td>\n",
" <td>CA</td>\n",
" <td>92563</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Web</td>\n",
" <td>2023-10-06</td>\n",
" <td>In progress</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>7658108</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2023-10-05</td>\n",
" <td>Credit reporting or other personal consumer re...</td>\n",
" <td>Credit reporting</td>\n",
" <td>Incorrect information on your report</td>\n",
" <td>Information belongs to someone else</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Experian Information Solutions Inc.</td>\n",
" <td>CT</td>\n",
" <td>06112</td>\n",
" <td>NaN</td>\n",
" <td>Other</td>\n",
" <td>Web</td>\n",
" <td>2023-10-05</td>\n",
" <td>In progress</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>7648067</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2023-10-19</td>\n",
" <td>Debt collection</td>\n",
" <td>Medical debt</td>\n",
" <td>Attempts to collect debt not owed</td>\n",
" <td>Debt was result of identity theft</td>\n",
" <td>NaN</td>\n",
" <td>Company has responded to the consumer and the ...</td>\n",
" <td>Ability Recovery Services, LLC</td>\n",
" <td>TX</td>\n",
" <td>76210</td>\n",
" <td>NaN</td>\n",
" <td>Consent not provided</td>\n",
" <td>Web</td>\n",
" <td>2023-10-19</td>\n",
" <td>Closed with explanation</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>7728101</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2023-10-06</td>\n",
" <td>Checking or savings account</td>\n",
" <td>Checking account</td>\n",
" <td>Managing an account</td>\n",
" <td>Banking errors</td>\n",
" <td>XX/XX/2023 Wells Fargo own Transaction record ...</td>\n",
" <td>Company has responded to the consumer and the ...</td>\n",
" <td>WELLS FARGO &amp; COMPANY</td>\n",
" <td>NV</td>\n",
" <td>89084</td>\n",
" <td>NaN</td>\n",
" <td>Consent provided</td>\n",
" <td>Web</td>\n",
" <td>2023-10-06</td>\n",
" <td>Closed with explanation</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>7649539</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2023-10-05</td>\n",
" <td>Credit reporting or other personal consumer re...</td>\n",
" <td>Credit reporting</td>\n",
" <td>Incorrect information on your report</td>\n",
" <td>Information belongs to someone else</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Experian Information Solutions Inc.</td>\n",
" <td>TX</td>\n",
" <td>75212</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Web</td>\n",
" <td>2023-10-05</td>\n",
" <td>In progress</td>\n",
" <td>Yes</td>\n",
" <td>NaN</td>\n",
" <td>7652371</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-242792ea-91a3-4d79-adee-8e0e991a242d')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-242792ea-91a3-4d79-adee-8e0e991a242d button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-242792ea-91a3-4d79-adee-8e0e991a242d');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-3283ac38-0bf8-4526-829a-c1a8f320c6ab\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-3283ac38-0bf8-4526-829a-c1a8f320c6ab')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-3283ac38-0bf8-4526-829a-c1a8f320c6ab button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "code",
"source": [
"df.columns"
],
"metadata": {
"id": "H-uMo7VPxQ2z",
"outputId": "f07e951c-2065-4dce-f928-ebe5eec6c508",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',\n",
" 'Consumer complaint narrative', 'Company public response', 'Company',\n",
" 'State', 'ZIP code', 'Tags', 'Consumer consent provided?',\n",
" 'Submitted via', 'Date sent to company', 'Company response to consumer',\n",
" 'Timely response?', 'Consumer disputed?', 'Complaint ID'],\n",
" dtype='object')"
]
},
"metadata": {},
"execution_count": 6
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:14:46.980471Z",
"start_time": "2018-11-16T08:11:13.917382Z"
},
"scrolled": true,
"id": "UZXJ5t6BdhYl",
"outputId": "28841b96-4a01-4d42-efcd-7f1c559ed9bf",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 293
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" description \\\n",
"0 XX/XX/2023 Wells Fargo own Transaction record clearly showed that {$4500.00} disappeared from my checking account without any knowledge or approval. The missing money was paid to another person account under \" XXXX XXXX '' Wells Fargo blamed me f... \n",
"1 My credit card was lost, I noticed some unauthorized charges appeared on my statement. I notified the credit company regarding the unauthorized charges but they denied the claim. \n",
"2 I contacted XXXX XXXX at XXXX XXXX XXXX, XXXX XXXX, Kansas, to inquire about a test drive for a vehicle shown on their website. I informed the salesman that I was interested in a newer model but saw that their car was comparable to the one that I... \n",
"5 I checked my credit report and found that some of the data were incorrect. The three credit bureaus are required by Sections 609 ( a ) ( 1 ) ( A ) and 611 ( a ) ( 1 ) to verify these items ( A ). It is not permitted to report these items as unver... \n",
"7 I was driving my boyfriend car but the car is registered to me. I was a secondary driver on insurance policy. I was in a motor vehicle accident XX/XX/XXXX. I was hit by a driver and had no fault. The person the vehicle belonged too was not the dr... \n",
"\n",
" target \n",
"0 Checking or savings account \n",
"1 Credit card \n",
"2 Credit reporting or other personal consumer reports \n",
"5 Credit reporting or other personal consumer reports \n",
"7 Debt collection "
],
"text/html": [
"\n",
" <div id=\"df-ee9d3b45-2d4f-474a-abd7-fd1a71545406\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>description</th>\n",
" <th>target</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>XX/XX/2023 Wells Fargo own Transaction record clearly showed that {$4500.00} disappeared from my checking account without any knowledge or approval. The missing money was paid to another person account under \" XXXX XXXX '' Wells Fargo blamed me f...</td>\n",
" <td>Checking or savings account</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>My credit card was lost, I noticed some unauthorized charges appeared on my statement. I notified the credit company regarding the unauthorized charges but they denied the claim.</td>\n",
" <td>Credit card</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>I contacted XXXX XXXX at XXXX XXXX XXXX, XXXX XXXX, Kansas, to inquire about a test drive for a vehicle shown on their website. I informed the salesman that I was interested in a newer model but saw that their car was comparable to the one that I...</td>\n",
" <td>Credit reporting or other personal consumer reports</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>I checked my credit report and found that some of the data were incorrect. The three credit bureaus are required by Sections 609 ( a ) ( 1 ) ( A ) and 611 ( a ) ( 1 ) to verify these items ( A ). It is not permitted to report these items as unver...</td>\n",
" <td>Credit reporting or other personal consumer reports</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>I was driving my boyfriend car but the car is registered to me. I was a secondary driver on insurance policy. I was in a motor vehicle accident XX/XX/XXXX. I was hit by a driver and had no fault. The person the vehicle belonged too was not the dr...</td>\n",
" <td>Debt collection</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-ee9d3b45-2d4f-474a-abd7-fd1a71545406')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-ee9d3b45-2d4f-474a-abd7-fd1a71545406 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-ee9d3b45-2d4f-474a-abd7-fd1a71545406');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-339ca14d-ab06-41fa-abbb-ede1f0e19713\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-339ca14d-ab06-41fa-abbb-ede1f0e19713')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-339ca14d-ab06-41fa-abbb-ede1f0e19713 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 7
}
],
"source": [
"df = df.loc[(df['Consumer complaint narrative'].notnull()), ['Consumer complaint narrative', 'Product']] \\\n",
" .reset_index() \\\n",
" .drop('index', axis = 1)\n",
"df = df[(np.logical_not(df.Product.str.contains(','))) & (df.Product != 'Credit card or prepaid card')]\n",
"df.columns = ['description', 'target']\n",
"pd.set_option('display.max_colwidth', 250)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6dNSJFn9dhYn"
},
"source": [
"Next, we will assign each target variable an integer value instead of using the string representation (Credit reporting for example). This allows our models to be able to recognize the responses. We can do this using a variety of methods, but here we will be using sklearn's LabelEncoder function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:14:47.932973Z",
"start_time": "2018-11-16T08:14:46.984508Z"
},
"id": "h76oqXmPdhYn",
"outputId": "fc064173-7d18-4f92-eacc-8fbd8c9316df",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 551
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" encoded_response\n",
"target \n",
"Bank account or service 0\n",
"Checking or savings account 1\n",
"Consumer Loan 2\n",
"Credit card 3\n",
"Credit reporting 4\n",
"Credit reporting or other personal consumer reports 5\n",
"Debt collection 6\n",
"Debt or credit management 7\n",
"Money transfers 8\n",
"Mortgage 9\n",
"Other financial service 10\n",
"Payday loan 11\n",
"Prepaid card 12\n",
"Student loan 13\n",
"Vehicle loan or lease 14"
],
"text/html": [
"\n",
" <div id=\"df-75471db1-57b4-439b-b108-d55e239fb77b\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>encoded_response</th>\n",
" </tr>\n",
" <tr>\n",
" <th>target</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bank account or service</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Checking or savings account</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Consumer Loan</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Credit card</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Credit reporting</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Credit reporting or other personal consumer reports</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Debt collection</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Debt or credit management</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Money transfers</th>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mortgage</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Other financial service</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Payday loan</th>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Prepaid card</th>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Student loan</th>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vehicle loan or lease</th>\n",
" <td>14</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-75471db1-57b4-439b-b108-d55e239fb77b')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-75471db1-57b4-439b-b108-d55e239fb77b button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-75471db1-57b4-439b-b108-d55e239fb77b');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-4a8ba772-5b55-48ab-8403-a9346193910b\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-4a8ba772-5b55-48ab-8403-a9346193910b')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-4a8ba772-5b55-48ab-8403-a9346193910b button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 8
}
],
"source": [
"encoder = LabelEncoder()\n",
"encoder.fit(df.target)\n",
"df = df.assign(encoded_response = lambda x: encoder.transform(x.target))\n",
"df[['target', 'encoded_response']].drop_duplicates() \\\n",
" .set_index('target') \\\n",
" .sort_values('encoded_response', ascending = True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i2_iNaGGdhYo"
},
"source": [
"We can see from the output above, that our resulting dataset contains 13 unique categories which we will try to classify complaints into by training our model to \"understand\" the narrative. <br><br>\n",
"After encoding our target variable, we can then move on to splitting our dataset for training and validation purposes. Train test splits are a crucial part of any modelling process and prevents [overfitting](https://en.wikipedia.org/wiki/Overfitting). Here, we are using an 80/20 split and setting a random seed for reproducibility."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:14:47.999483Z",
"start_time": "2018-11-16T08:14:47.940983Z"
},
"id": "TrrbsPFfdhYo"
},
"outputs": [],
"source": [
"x_train, x_test, y_train, y_test, indices_train, indices_test = train_test_split(df.description,\n",
" df.encoded_response,\n",
" df.index,\n",
" test_size = 0.2,\n",
" random_state = 1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N2RfY3ErdhYo"
},
"source": [
"## Feature Extraction"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "j4RahTLqdhYp"
},
"source": [
"Just as we have encoded our target variables, we must also find a way to transform our description data out if its string representation into a numerical one. However, unlike the target variables, this process is not as simple as allocating an integer value to each unique complaint. Keep in mind that a meaningful transformation must somehow display some form of homogeneity between complaints within the same product categories."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "37wbNimvdhYp"
},
"source": [
"### Bag of Words (BOW) Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gZ2T84AldhYp"
},
"source": [
"One way of transforming a document full of text data into numerical features is by using the BOW (Bag of Words) model. Put simply, all it does is assign each unique word (or token) an ID number and counts up its frequency. For example, if we have a document: <br><br>\n",
"`\"This is a cat. That is a dog\"`\n",
"<br><br>\n",
"The BOW representation would simply be: <br><br>\n",
"`BOW = {\"This\": 1, \"is\": 2, \"a\": 2, \"cat\" : 1, \"That\" : 1, \"dog\" : 1}`\n",
"<br><br>\n",
"Notice that the document shown above is clearly about a cat and a dog. However, our BOW model (also called count vectorizer) shows that the most frequent words present in the document are \"is\" and \"a\". These common words are also known as [\"stop words\"](https://en.wikipedia.org/wiki/Stop_words) and are usually filtered out of the BOW model during the pre-processing phase to prevent overpowering the words that have actual importance. There are many different little tricks and techniques for choosing the most suitable bag of words to represent your documents and most can be implemented simply through a line (or two) of code using [regular expressions](https://docs.python.org/2/library/re.html), which is a great way to filter texts and I highly recommend getting comfortable with using them.<br><br>\n",
"How does this link back to the goal of having numerical features as our inputs? Imagine having a thousand text documents and extracting all (a subset) of the unique tokens that we think best represents them. Now, we can assign each unique token to a column in a dataframe and populate the rows with each document’s count of the respective words. Following the example above, our single document would produce a dataframe that looks like: <br><br>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:14:48.026977Z",
"start_time": "2018-11-16T08:14:48.010973Z"
},
"id": "xZ4rtFb1dhYp",
"outputId": "b3dd4a54-1b83-47f1-8f23-391c8b1d11f1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" This is a cat That dog\n",
"0 1 2 2 1 1 1"
],
"text/html": [
"\n",
" <div id=\"df-68c2e618-65d3-43ff-990d-df5f0c17665c\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>This</th>\n",
" <th>is</th>\n",
" <th>a</th>\n",
" <th>cat</th>\n",
" <th>That</th>\n",
" <th>dog</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-68c2e618-65d3-43ff-990d-df5f0c17665c')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-68c2e618-65d3-43ff-990d-df5f0c17665c button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-68c2e618-65d3-43ff-990d-df5f0c17665c');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 10
}
],
"source": [
"pd.DataFrame({\"This\": 1, \"is\": 2, \"a\": 2, \"cat\" : 1,\n",
" \"That\" : 1, \"dog\" : 1}, index = range(1))"
]
},
{
"cell_type": "markdown",
"source": [
"We can of course imagine multiple sentences to create a database. A second sentence could be. `\"This dog acts like a human\"`"
],
"metadata": {
"id": "cTmNRtSZyb0H"
}
},
{
"cell_type": "code",
"source": [
"pd.DataFrame([{\"This\": 1, \"is\": 2, \"a\": 2, \"cat\" : 1,\n",
" \"That\" : 1, \"dog\" : 1, \"acts\":0,\"like\":0, \"human\":0},\n",
" {\"This\": 1, \"is\": 0, \"a\": 1, \"cat\" : 0,\n",
" \"That\" : 0, \"dog\" : 1, \"acts\":1,\"like\":1, \"human\":1}], index = range(2))"
],
"metadata": {
"id": "8Ezxw62Wyouw",
"outputId": "8ed6a136-1136-410c-cb82-ff358426ef7f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
}
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" This is a cat That dog acts like human\n",
"0 1 2 2 1 1 1 0 0 0\n",
"1 1 0 1 0 0 1 1 1 1"
],
"text/html": [
"\n",
" <div id=\"df-80833b55-9c52-470b-a5fe-9d89fff1c26b\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>This</th>\n",
" <th>is</th>\n",
" <th>a</th>\n",
" <th>cat</th>\n",
" <th>That</th>\n",
" <th>dog</th>\n",
" <th>acts</th>\n",
" <th>like</th>\n",
" <th>human</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-80833b55-9c52-470b-a5fe-9d89fff1c26b')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-80833b55-9c52-470b-a5fe-9d89fff1c26b button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-80833b55-9c52-470b-a5fe-9d89fff1c26b');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-e7b1cfe9-468d-40ea-bb02-2a234a66971d\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e7b1cfe9-468d-40ea-bb02-2a234a66971d')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-e7b1cfe9-468d-40ea-bb02-2a234a66971d button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"source": [
"Then finally we would want to predict what the sentence is about, and the columns become important features."
],
"metadata": {
"id": "fX3bBIP6zhZt"
}
},
{
"cell_type": "code",
"source": [
"pd.DataFrame([{\"This\": 1, \"is\": 2, \"a\": 2, \"cat\" : 1,\n",
" \"That\" : 1, \"dog\" : 1, \"acts\":0,\"like\":0, \"human\":0, \"PREDICT CAT-DOG 0/1\": 0},\n",
" {\"This\": 1, \"is\": 0, \"a\": 1, \"cat\" : 0,\n",
" \"That\" : 0, \"dog\" : 1, \"acts\":1,\"like\":1, \"human\":1, \"PREDICT CAT-DOG 0/1\": 1}], index = range(2))"
],
"metadata": {
"id": "3FLAZopN0Mlt",
"outputId": "57ac9669-bea3-4ea5-c609-b45abb443207",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
}
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" This is a cat That dog acts like human PREDICT CAT-DOG 0/1\n",
"0 1 2 2 1 1 1 0 0 0 0\n",
"1 1 0 1 0 0 1 1 1 1 1"
],
"text/html": [
"\n",
" <div id=\"df-0937b0a7-3760-4670-bcf6-ae49e6ad093c\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>This</th>\n",
" <th>is</th>\n",
" <th>a</th>\n",
" <th>cat</th>\n",
" <th>That</th>\n",
" <th>dog</th>\n",
" <th>acts</th>\n",
" <th>like</th>\n",
" <th>human</th>\n",
" <th>PREDICT CAT-DOG 0/1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0937b0a7-3760-4670-bcf6-ae49e6ad093c')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-0937b0a7-3760-4670-bcf6-ae49e6ad093c button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-0937b0a7-3760-4670-bcf6-ae49e6ad093c');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-acb4e7e7-7d2b-4dfa-88d8-40db76a9a175\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-acb4e7e7-7d2b-4dfa-88d8-40db76a9a175')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-acb4e7e7-7d2b-4dfa-88d8-40db76a9a175 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z4tyPK7ldhYq"
},
"source": [
"Realistically, if we had chosen 15,000 unique tokens from the 1000 documents, our input matrix would then have 15,000 columns! You can start to imagine how sparse (full of zeros) our resulting input matrix will be! This technique is also commonly known as count vectorizing. <br><br>\n",
"In this article however, we will be using a more robust model than count vectorizing called (TF-IDF) Term Frequency - Inverse Document Frequency, and it is defined as:\n",
"\n",
"$w_{i,j} = tf_{i,j} * log(\\dfrac{N}{df_i})$\n",
"<br><br>\n",
"$w_{i,j}$ = Weight for word(i) in document(j) <br>\n",
"$tf_{i,j}$ = Count of word(i) in document(j) <br>\n",
"$N$ = Total number of documents <br>\n",
"$df_i$ = How many documents word(i) appears in<br><br>\n",
"\n",
"We can see that, the first term of the TF-IDF model is just the count of the word in the document as before. The magic happens in the second term where the model imposes an additional condition for a word to be deemed \"important\". Just as an example, if the word \"bank\" appears in every single document, it wouldn't be of much use in differentiating the documents, and the second term of the TF-IDF model expresses this by reducing the whole weight down to 0."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FqruSCMBdhYq"
},
"source": [
"There are a variety of packages that help to automate the vectorizing process, but we have chosen to use the Sci-kit learn API due to its easy of usage and interpretability. Here, we instantiate a TfidfVectorizer model, with a few note-worthy details: <br>\n",
"* Sublinear_tf uses a logarithmic form of frequency as 20 occurrences of a word does not imply 20 times the importance in most cases <br>\n",
"* The model will ignore words that appear in less than 5 documents, as well as more than 70% of the total documents <br>\n",
"* Normalization is set to L2 (Not to be confused with L2 regularisation) so that all vectors are scaled to have a magnitude of 1 (The equivalent of standardizing your features) <br>\n",
"* There is a layer of pre-processing to remove numerical characters and symbols within the documents using a regular expression (REGEX)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:14:48.039968Z",
"start_time": "2018-11-16T08:14:48.032969Z"
},
"id": "3s82QYuVdhYq"
},
"outputs": [],
"source": [
"tfidf_vectorizer = TfidfVectorizer(sublinear_tf = True,\n",
" stop_words = 'english',\n",
" min_df = 5,\n",
" max_df = 0.7,\n",
" norm = 'l2',\n",
" token_pattern = r'\\b[^_\\d\\W]+\\b')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zX7OoSKNdhYr"
},
"source": [
"We then use the model to fit our original description data, and convert all that sweet text information into something numerical!"
]
},
{
"cell_type": "code",
"source": [
"x_train.shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "WANhNNjygqKU",
"outputId": "49d84113-6063-4c11-af81-576fa310e74d"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(4469,)"
]
},
"metadata": {},
"execution_count": 14
}
]
},
{
"cell_type": "code",
"source": [
"x_train"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EaTAu472374e",
"outputId": "ea5eda11-e6fa-4853-88b6-61d38b6a5401"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"14714 I did not authorize the following inquiries : Transunion- XXXX XXXXXXXX-XX/XX/2017 XXXX XXXX XX/XX/2017 XXXX- XXXX XXXX XX/XX/2017XX/XX/2017\n",
"5004 I am banking with Capital one bank for about 5 years. I have checking as well as savings account with them. On XXXX XXXX a person to person money transfer was made from my account of {$600.00} to a person named XXXX XXXX by XXXX money transfer co...\n",
"5734 On XX/XX/2021 I called Navient to get some information about the co-signer release program on my loan. At one point in the conversation, I was told incorrectly that my loans were in an interest-only payment status, which was news to me and I have...\n",
"1119 I reviewed my Consumer Reports and noticed that I had One late payment on an account that I was never late for. Consumer Reporting Agencies have assumed a vital role and have a responsibility to report Consumer information to the best of their ab...\n",
"8926 There is a debt on my credit report that does not belong to me, the credit report had removed it as they can not validate the debt, but to my surprise two weeks ago they added the debt back to my account without given me a written warning nothing...\n",
" ... \n",
"3230 CFPB, I have been the victim of Identity Theft, someone opened a payday loan account with Check N Go with my information. I have escalated this to FTD and the DOJ. The requirement of the Bank to respond to credit bureau disputes is outlined by th...\n",
"14181 I am billed from collection agency for bill on XX/XX/16 paid by my insurance companies which was primary XXXX paid XXXX, the secondary insurance XXXX would covered XXXX any balances should have been billed to XXXX as a secondary insurance in time...\n",
"11181 Attempting to collect a debt I don't know. Tried to get resolved previously.\n",
"776 On XX/XX/XXXX I applied for a VA home loan through USAA. This would be the second time I would have a VA loan. My first home was through a VA loan. I receive a pre-approval on the same day and immediately started submitting my required paperwork....\n",
"14008 On XX/XX/XXXX XXXX XXXX XXXX XXXX XXXX XXXX charged me for liability insurance ( {$270.00} ) on a HELOC ( acct # XXXX originated by BB & T bank XXXX now Truist ) that has had a XXXX balance since it was paid in full in XX/XX/XXXX. This is on a pr...\n",
"Name: description, Length: 4469, dtype: object"
]
},
"metadata": {},
"execution_count": 15
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:15:46.665241Z",
"start_time": "2018-11-16T08:14:48.044968Z"
},
"id": "ASdzl-IudhYr"
},
"outputs": [],
"source": [
"tfidf_train = tfidf_vectorizer.fit_transform(x_train.values)\n",
"tfidf_test = tfidf_vectorizer.transform(x_test.values)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:15:46.711749Z",
"start_time": "2018-11-16T08:15:46.669249Z"
},
"id": "xWrLGFLedhYs",
"outputId": "4300d68f-16e5-44aa-fc23-d8fd6c96c947",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"4500\n"
]
}
],
"source": [
"print(len(tfidf_vectorizer.get_feature_names_out()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3vGvUd0pdhYs"
},
"source": [
"We can see that our resulting dictionary consists of 4249 different unique words!"
]
},
{
"cell_type": "code",
"source": [
"tfidf_train"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ugERkn7cixuB",
"outputId": "49f1e416-037a-4c27-f9ff-5c2d8aad9b53"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<4469x4500 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 242740 stored elements in Compressed Sparse Row format>"
]
},
"metadata": {},
"execution_count": 18
}
]
},
{
"cell_type": "code",
"source": [
"tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf_train,\n",
" columns = tfidf_vectorizer.get_feature_names_out())"
],
"metadata": {
"id": "OGCVn85_jUwy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:16:17.160179Z",
"start_time": "2018-11-16T08:16:01.463243Z"
},
"id": "xuh-T7WWdhYt",
"outputId": "9b852578-7f08-4fcd-e2f7-069d75f612cb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" abandoned abide ability able abruptly ... yrs z zero zip \\\n",
"4464 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 \n",
"4465 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 \n",
"4466 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 \n",
"4467 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 \n",
"4468 0.0 0.0 0.0 0.080909 0.0 ... 0.0 0.0 0.0 0.0 \n",
"\n",
" zombie \n",
"4464 0.0 \n",
"4465 0.0 \n",
"4466 0.0 \n",
"4467 0.0 \n",
"4468 0.0 \n",
"\n",
"[5 rows x 4500 columns]"
],
"text/html": [
"\n",
" <div id=\"df-d01f4729-c5a8-4511-a5d9-230e7b3c8b52\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>abandoned</th>\n",
" <th>abide</th>\n",
" <th>ability</th>\n",
" <th>able</th>\n",
" <th>abruptly</th>\n",
" <th>...</th>\n",
" <th>yrs</th>\n",
" <th>z</th>\n",
" <th>zero</th>\n",
" <th>zip</th>\n",
" <th>zombie</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4464</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4465</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4466</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4467</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4468</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.080909</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 4500 columns</p>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d01f4729-c5a8-4511-a5d9-230e7b3c8b52')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-d01f4729-c5a8-4511-a5d9-230e7b3c8b52 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-d01f4729-c5a8-4511-a5d9-230e7b3c8b52');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-42ee2788-2803-4cc0-8dd0-d3944fe15455\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-42ee2788-2803-4cc0-8dd0-d3944fe15455')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-42ee2788-2803-4cc0-8dd0-d3944fe15455 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 20
}
],
"source": [
"pd.options.display.max_columns = 10\n",
"tfidf_df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_aYorEkTdhYu"
},
"source": [
"As mentioned above, the 4000+ words chosen to be in our \"dictionary\" become the columns of our input matrix and just to reiterate, since we have so many columns and each row only consists of words within 1 document, the resulting matrix will be extremely sparse! (Consisting mostly of zeros) <br><br>\n",
"Now that we have extracted some numerical features from our dataset, it is time to use them to train a classifier!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BPofAO55dhYu"
},
"source": [
"## Logistic Regression Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0y8vjXUodhYu"
},
"source": [
"In this example, we have chosen to use a basic [logistic regression model](http://www.appstate.edu/~whiteheadjc/service/logit/intro.htm) to classify the documents due to tractability and convention. However, more complex models such as neural networks, decision trees, naive bayesian classifiers or other relevant models can also be used to do the following classification and the reader is encouraged to try different machine learning models to see which ones give the most accurate results! <br><br>\n",
"Firstly, we instantiate a logistic regression classifier from the sklearn package and proceed to fit our TF-IDF vectorized matrix and our encoded training labels."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:17:48.728778Z",
"start_time": "2018-11-16T08:16:17.165234Z"
},
"id": "GrEhMG42dhYu",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 213
},
"outputId": "e8b54664-acb7-4734-bb3a-663f350b1ae5"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"LogisticRegression()"
],
"text/html": [
"<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LogisticRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression()</pre></div></div></div></div></div>"
]
},
"metadata": {},
"execution_count": 21
}
],
"source": [
"lr_classifier = LogisticRegression()\n",
"lr_classifier.fit(tfidf_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a8phPYCJdhYu"
},
"source": [
"We can observe that amongst the model parameters, sklearn has automatically set 'penalty' to L2. This is a form of regularization, which weighs the benefits of a better fit against using more features by adding an extra term to the objective function. Recall that when using the BOW model, we ended up creating thousands of features. Adding regularization to our model would then help to prevent using too many of them!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:24:09.833368Z",
"start_time": "2018-11-16T08:24:08.330375Z"
},
"scrolled": true,
"id": "hdcPh3PQdhYv",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"outputId": "5e770b2c-e04a-435f-f402-b62707c3fb2c"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" features coefficients\n",
"393 bank 3.810558\n",
"2806 overdraft 1.935661\n",
"1592 fees 1.887826\n",
"1728 funds 1.557623\n",
"2589 money 1.518259\n",
"1020 debit 1.501463\n",
"32 account 1.491159\n",
"2620 n 1.410070\n",
"1586 fee 1.409157\n",
"649 checking 1.365648"
],
"text/html": [
"\n",
" <div id=\"df-372219ff-da30-47c8-a227-32c867b9ebc6\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>features</th>\n",
" <th>coefficients</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>393</th>\n",
" <td>bank</td>\n",
" <td>3.810558</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2806</th>\n",
" <td>overdraft</td>\n",
" <td>1.935661</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1592</th>\n",
" <td>fees</td>\n",
" <td>1.887826</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1728</th>\n",
" <td>funds</td>\n",
" <td>1.557623</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2589</th>\n",
" <td>money</td>\n",
" <td>1.518259</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1020</th>\n",
" <td>debit</td>\n",
" <td>1.501463</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>account</td>\n",
" <td>1.491159</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2620</th>\n",
" <td>n</td>\n",
" <td>1.410070</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1586</th>\n",
" <td>fee</td>\n",
" <td>1.409157</td>\n",
" </tr>\n",
" <tr>\n",
" <th>649</th>\n",
" <td>checking</td>\n",
" <td>1.365648</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-372219ff-da30-47c8-a227-32c867b9ebc6')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-372219ff-da30-47c8-a227-32c867b9ebc6 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-372219ff-da30-47c8-a227-32c867b9ebc6');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-21c6b4a4-9828-4fe1-91ba-3446eef14f60\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-21c6b4a4-9828-4fe1-91ba-3446eef14f60')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-21c6b4a4-9828-4fe1-91ba-3446eef14f60 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"coefficients = pd.DataFrame(dict(zip(tfidf_df.columns,lr_classifier.coef_[0])),index = [0]).T.reset_index()\n",
"coefficients.columns = ['features', 'coefficients']\n",
"coefficients.sort_values(\"coefficients\",ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0yUbmmcXdhYv"
},
"source": [
"Since we have converted all our \"text\" data into numerical features, we can view the weights and features just as we would be able to for any other logisitic regression model on numerical data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:17:50.395323Z",
"start_time": "2018-11-16T08:17:50.316787Z"
},
"id": "sp-D4SBydhYv",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "421c7c15-2ddc-479a-cfe2-ab68aa8d9a57"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"0.817531305903399\n"
]
}
],
"source": [
"pred = lr_classifier.predict(tfidf_test)\n",
"score = metrics.accuracy_score(y_test, pred)\n",
"print(score)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r7ZVpEQSdhYv"
},
"source": [
"After running the model on the validation set, we can see that we attained some pretty decent results with an overall accuracy of ~80% with just a simple logistic regression classifier. We can drill down into this result by evaluating the [confusion matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:17:50.488780Z",
"start_time": "2018-11-16T08:17:50.401773Z"
},
"id": "1aqtlmmedhYv"
},
"outputs": [],
"source": [
"cm = metrics.confusion_matrix(y_test, pred)\n",
"cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C-UnxLHxdhYw"
},
"source": [
"Due to the huge imbalance of data points between the different categories, the resulting figures in the confusion matrix are normalized to produce clearer distinctions between “good” and “poor” performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:17:52.986277Z",
"start_time": "2018-11-16T08:17:50.494831Z"
},
"scrolled": false,
"id": "RzwooLNwdhYw",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"outputId": "56d16d49-7981-4d1a-f249-e394f3b82a23"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Accuracy Score: 0.817531305903399')"
]
},
"metadata": {},
"execution_count": 25
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1500x1500 with 2 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"plt.figure(figsize = (15,15))\n",
"sns.heatmap(cm,\n",
" annot = True,\n",
" fmt = \".3f\",\n",
" linewidths = 0.5,\n",
" square = True,\n",
" cmap = 'Reds',\n",
" xticklabels = encoder.classes_,\n",
" yticklabels = encoder.classes_)\n",
"plt.ylabel('Actual')\n",
"plt.xlabel('Predicted')\n",
"all_sample_title = 'Accuracy Score: {}'.format(score)\n",
"plt.title(all_sample_title, size = 15)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LP-8enUbdhYw"
},
"source": [
"One of the more obvious results we can observe just from diagonals of the matrix is that many categories like 'consumer loan' and Bank account of service' has almost no correct predictions. We must note however, that the number of data points we had for each of these categories were insignificant compared to the others and the model probably treated those instances as errors instead of a distinct category. This bias however, can be corrected by techniques such as [over and undersampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis). Let's investigate further to see if the model does indeed treat the documents within these categories as errors."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o37kEaVWdhYx"
},
"source": [
"Categories with more data points generally had a better score, and the reasons for misclassifications in these categories seem to follow the trend of overlapping categories. (For example, 320 descriptions that were supposed to be \"Bank account and service\" were wrongly classified as \"Checking and savings account\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:17:53.123272Z",
"start_time": "2018-11-16T08:17:53.113269Z"
},
"id": "haDB530cdhYx"
},
"outputs": [],
"source": [
"def observe_errors(actual_response, wrongly_predicted_response):\n",
" warnings.filterwarnings(action = 'ignore', category = DeprecationWarning)\n",
" compare = pd.DataFrame(list(zip(x_test, y_test, pred)), columns = ['description', 'actual', 'predicted'])\n",
" compare = compare.assign(actual_product = encoder.inverse_transform(compare.actual),\n",
" predicted_product = encoder.inverse_transform(compare.predicted)) \\\n",
" .loc[(compare.actual == actual_response) & (compare.predicted == wrongly_predicted_response),\n",
" ['description', 'actual_product', 'predicted_product']]\n",
" return compare"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-16T08:17:53.548054Z",
"start_time": "2018-11-16T08:17:53.181781Z"
},
"id": "MP2BHfgHdhYx",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 310
},
"outputId": "788ad62a-8814-4956-b4a3-3e955aeb367d"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" description \\\n",
"897 I opened a business merchant account about two months ago with bank of America in XXXX, WA. on XXXX XXXX. at zip code XXXX. I felt I was misrepresented about all the fees included in the merchant account service. I was informed that I would recei... \n",
"900 My grandfather XXXX XXXX opened a Certificate of Deposit at XXXX in my name XXXX XXXX on XXXX XXXX, XXXX. XXXX XXXX merged with Capitol One Bank on XXXX XXXX, XXXX. My grandfather recently passed on XXXX XXXX, XXXX and I found the certificate. On... \n",
"984 Premise : I ordered book of checks via US Bank web-portal service. \\nFew weeks went by, still not having received the checks, I called in to find out that they have sent it to some random address. The rep then proceeded to cancel the incorrectly ... \n",
"1007 On XX/XX/2016, at XXXX, I went to an outside ATM machine at BMO Harris Bank, XXXX, IL., to withdraw {$640.00} from my XXXX card. However, the ATM never dispensed the money, money never came out, though it gave me a receipt as if it had given me m... \n",
"1051 Chase has withdrawn money from my personal checking account ( non chase account ) for over four years. When I called Chase ( five times on XX/XX/2016 ), to inquire about the Chase account where the money is going, they were unable to provide the ... \n",
"\n",
" actual_product predicted_product \n",
"897 Bank account or service Checking or savings account \n",
"900 Bank account or service Checking or savings account \n",
"984 Bank account or service Checking or savings account \n",
"1007 Bank account or service Checking or savings account \n",
"1051 Bank account or service Checking or savings account "
],
"text/html": [
"\n",
" <div id=\"df-d1311162-5686-455a-8fc0-33af548d1b98\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>description</th>\n",
" <th>actual_product</th>\n",
" <th>predicted_product</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>897</th>\n",
" <td>I opened a business merchant account about two months ago with bank of America in XXXX, WA. on XXXX XXXX. at zip code XXXX. I felt I was misrepresented about all the fees included in the merchant account service. I was informed that I would recei...</td>\n",
" <td>Bank account or service</td>\n",
" <td>Checking or savings account</td>\n",
" </tr>\n",
" <tr>\n",
" <th>900</th>\n",
" <td>My grandfather XXXX XXXX opened a Certificate of Deposit at XXXX in my name XXXX XXXX on XXXX XXXX, XXXX. XXXX XXXX merged with Capitol One Bank on XXXX XXXX, XXXX. My grandfather recently passed on XXXX XXXX, XXXX and I found the certificate. On...</td>\n",
" <td>Bank account or service</td>\n",
" <td>Checking or savings account</td>\n",
" </tr>\n",
" <tr>\n",
" <th>984</th>\n",
" <td>Premise : I ordered book of checks via US Bank web-portal service. \\nFew weeks went by, still not having received the checks, I called in to find out that they have sent it to some random address. The rep then proceeded to cancel the incorrectly ...</td>\n",
" <td>Bank account or service</td>\n",
" <td>Checking or savings account</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1007</th>\n",
" <td>On XX/XX/2016, at XXXX, I went to an outside ATM machine at BMO Harris Bank, XXXX, IL., to withdraw {$640.00} from my XXXX card. However, the ATM never dispensed the money, money never came out, though it gave me a receipt as if it had given me m...</td>\n",
" <td>Bank account or service</td>\n",
" <td>Checking or savings account</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1051</th>\n",
" <td>Chase has withdrawn money from my personal checking account ( non chase account ) for over four years. When I called Chase ( five times on XX/XX/2016 ), to inquire about the Chase account where the money is going, they were unable to provide the ...</td>\n",
" <td>Bank account or service</td>\n",
" <td>Checking or savings account</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d1311162-5686-455a-8fc0-33af548d1b98')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-d1311162-5686-455a-8fc0-33af548d1b98 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-d1311162-5686-455a-8fc0-33af548d1b98');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-9f610db4-9956-47b5-8117-e28cbe229bac\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-9f610db4-9956-47b5-8117-e28cbe229bac')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-9f610db4-9956-47b5-8117-e28cbe229bac button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 31
}
],
"source": [
"observe_errors(actual_response = 0, wrongly_predicted_response = 1).tail(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "klaYGeoKdhYy"
},
"source": [
"We can see that the misclassified complaints shown above is quite ambiguous and contain ideas and keywords within the narrative related to both categories, which is presumably why the model would have misclassified them."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "btJWAMxPdhYy"
},
"source": [
"Because of the how text data is converted into numerical features, data pre-processing is an extremely important step when it comes to NLP problems and the \"garbage-in, garbage-out\" property can be very prevalent compared to other machine learning techniques. Also note that, in this problem that there are existing labels in the dataset, and we could have just as easily used any other supervised learning techniques to classify the documents. If, however, no labels are available, we would then have to turn towards unsupervised learning algorithms such as SVM, K-means or LDA to try and cluster the documents into sensible categories."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_qaejjybdhYy"
},
"source": [
"We have essentially just taught the computer how to \"understand\" a small selected group of documents in a very human way with approximately 50 lines of code (ignoring the functions for visualisation)!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
},
"colab": {
"name": "Financial Complaints Classification.ipynb",
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment