Last active
November 18, 2017 05:00
-
-
Save vijayanandrp/5be3b66b50254ac791fc0fbbcd6d30b8 to your computer and use it in GitHub Desktop.
Tutorial: Machine Learning with Text in scikit-learn by Kevin Markham
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Tutorial: Machine Learning with Text in scikit-learn by Kevin Markham" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Agenda\n", | |
"\n", | |
"1. Model building in scikit-learn (refresher)\n", | |
"2. Representing text as numerical data\n", | |
"3. Reading a text-based dataset into pandas\n", | |
"4. Vectorizing our dataset\n", | |
"5. Building and evaluating a model\n", | |
"6. Comparing models\n", | |
"7. Examining a model for further insight\n", | |
"8. Practicing this workflow on another dataset\n", | |
"9. Tuning the vectorizer (discussion)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# for Python 2: use print only as a function\n", | |
"#from __future__ import print_function" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 1: Model building in scikit-learn (refresher)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# load the iris dataset as an example\n", | |
"from sklearn.datasets import load_iris\n", | |
"iris = load_iris()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# store the feature matrix (X) and response vector (y) - Supervised Learning\n", | |
"X = iris.data\n", | |
"y = iris.target" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output. \n", | |
"\n", | |
"Why are we denoting **X** is in uppercase and **y** is in lowercase? \n", | |
"Answer: \n", | |
"X is two-dimensional data and y is a single dimensional data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(150, 4)\n", | |
"(150,)\n" | |
] | |
} | |
], | |
"source": [ | |
"# check the shapes of X and y\n", | |
"print(X.shape) # matrix\n", | |
"print(y.shape) # vector" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**\"Observations\"** are also known as samples, instances, or records." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>sepal length (cm)</th>\n", | |
" <th>sepal width (cm)</th>\n", | |
" <th>petal length (cm)</th>\n", | |
" <th>petal width (cm)</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>5.1</td>\n", | |
" <td>3.5</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>4.9</td>\n", | |
" <td>3.0</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>4.7</td>\n", | |
" <td>3.2</td>\n", | |
" <td>1.3</td>\n", | |
" <td>0.2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>4.6</td>\n", | |
" <td>3.1</td>\n", | |
" <td>1.5</td>\n", | |
" <td>0.2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>5.0</td>\n", | |
" <td>3.6</td>\n", | |
" <td>1.4</td>\n", | |
" <td>0.2</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", | |
"0 5.1 3.5 1.4 0.2\n", | |
"1 4.9 3.0 1.4 0.2\n", | |
"2 4.7 3.2 1.3 0.2\n", | |
"3 4.6 3.1 1.5 0.2\n", | |
"4 5.0 3.6 1.4 0.2" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the first 5 rows of the feature matrix (including the feature names)\n", | |
"import pandas as pd\n", | |
"pd.DataFrame(X, columns=iris.feature_names).head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", | |
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", | |
" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", | |
" 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", | |
" 2 2]\n" | |
] | |
} | |
], | |
"source": [ | |
"# examine the response vector\n", | |
"print(y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", | |
" metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", | |
" weights='uniform')" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# import the class\n", | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"\n", | |
"# instantiate the model (with the default parameters)\n", | |
"knn = KNeighborsClassifier()\n", | |
"\n", | |
"# fit the model with data (occurs in-place)\n", | |
"knn.fit(X, y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([0])" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# predict the response for a new observation\n", | |
"knn.predict([[3, 5, 4, 1]])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 2: Representing text as numerical data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# example text for model training (SMS messages)\n", | |
"simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", | |
"\n", | |
"> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", | |
"\n", | |
"We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# import and instantiate CountVectorizer (with the default parameters)\n", | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"vect = CountVectorizer()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", | |
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", | |
" tokenizer=None, vocabulary=None)" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# learn the 'vocabulary' of the training data (occurs in-place)\n", | |
"vect.fit(simple_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['cab', 'call', 'me', 'please', 'tonight', 'you']" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the fitted vocabulary\n", | |
"vect.get_feature_names()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<3x6 sparse matrix of type '<class 'numpy.int64'>'\n", | |
"\twith 9 stored elements in Compressed Sparse Row format>" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# transform training data into a 'document-term matrix'\n", | |
"simple_train_dtm = vect.transform(simple_train)\n", | |
"simple_train_dtm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[0, 1, 0, 0, 1, 1],\n", | |
" [1, 1, 1, 0, 0, 0],\n", | |
" [0, 1, 1, 2, 0, 0]])" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# convert sparse matrix to a dense matrix\n", | |
"simple_train_dtm.toarray()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>cab</th>\n", | |
" <th>call</th>\n", | |
" <th>me</th>\n", | |
" <th>please</th>\n", | |
" <th>tonight</th>\n", | |
" <th>you</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" cab call me please tonight you\n", | |
"0 0 1 0 0 1 1\n", | |
"1 1 1 1 0 0 0\n", | |
"2 0 1 1 2 0 0" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the vocabulary and document-term matrix together\n", | |
"pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", | |
"\n", | |
"> In this scheme, features and samples are defined as follows:\n", | |
"\n", | |
"> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", | |
"> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", | |
"\n", | |
"> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", | |
"\n", | |
"> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"scipy.sparse.csr.csr_matrix" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# check the type of the document-term matrix\n", | |
"type(simple_train_dtm)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" (0, 1)\t1\n", | |
" (0, 4)\t1\n", | |
" (0, 5)\t1\n", | |
" (1, 0)\t1\n", | |
" (1, 1)\t1\n", | |
" (1, 2)\t1\n", | |
" (2, 1)\t1\n", | |
" (2, 2)\t1\n", | |
" (2, 3)\t2\n" | |
] | |
} | |
], | |
"source": [ | |
"# examine the sparse matrix contents\n", | |
"print(simple_train_dtm)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", | |
"\n", | |
"> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", | |
"\n", | |
"> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", | |
"\n", | |
"> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# example text for model testing\n", | |
"simple_test = [\"please don't call me\"]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[0, 1, 1, 1, 0, 0]])" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# transform testing data into a document-term matrix (using existing vocabulary)\n", | |
"simple_test_dtm = vect.transform(simple_test)\n", | |
"simple_test_dtm.toarray()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>cab</th>\n", | |
" <th>call</th>\n", | |
" <th>me</th>\n", | |
" <th>please</th>\n", | |
" <th>tonight</th>\n", | |
" <th>you</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" cab call me please tonight you\n", | |
"0 0 1 1 1 0 0" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the vocabulary and document-term matrix together\n", | |
"pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Summary:**\n", | |
"\n", | |
"- `vect.fit(train)` **learns the vocabulary** of the training data\n", | |
"- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", | |
"- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 3: Reading a text-based dataset into pandas" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# read file into pandas using a relative path\n", | |
"path = 'data/sms.tsv'\n", | |
"sms = pd.read_table(path, header=None, names=['label', 'message'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# alternative: read file into pandas from a URL\n", | |
"# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'\n", | |
"# sms = pd.read_table(url, header=None, names=['label', 'message'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(5572, 2)" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the shape\n", | |
"sms.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>label</th>\n", | |
" <th>message</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>ham</td>\n", | |
" <td>Go until jurong point, crazy.. Available only ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>ham</td>\n", | |
" <td>Ok lar... Joking wif u oni...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>spam</td>\n", | |
" <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>ham</td>\n", | |
" <td>U dun say so early hor... U c already then say...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>ham</td>\n", | |
" <td>Nah I don't think he goes to usf, he lives aro...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>spam</td>\n", | |
" <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>ham</td>\n", | |
" <td>Even my brother is not like to speak with me. ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>ham</td>\n", | |
" <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>spam</td>\n", | |
" <td>WINNER!! As a valued network customer you have...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>spam</td>\n", | |
" <td>Had your mobile 11 months or more? U R entitle...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" label message\n", | |
"0 ham Go until jurong point, crazy.. Available only ...\n", | |
"1 ham Ok lar... Joking wif u oni...\n", | |
"2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", | |
"3 ham U dun say so early hor... U c already then say...\n", | |
"4 ham Nah I don't think he goes to usf, he lives aro...\n", | |
"5 spam FreeMsg Hey there darling it's been 3 week's n...\n", | |
"6 ham Even my brother is not like to speak with me. ...\n", | |
"7 ham As per your request 'Melle Melle (Oru Minnamin...\n", | |
"8 spam WINNER!! As a valued network customer you have...\n", | |
"9 spam Had your mobile 11 months or more? U R entitle..." | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the first 10 rows\n", | |
"sms.head(10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"ham 4825\n", | |
"spam 747\n", | |
"Name: label, dtype: int64" | |
] | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the class distribution\n", | |
"sms.label.value_counts()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# convert label to a numerical variable\n", | |
"sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>label</th>\n", | |
" <th>message</th>\n", | |
" <th>label_num</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>ham</td>\n", | |
" <td>Go until jurong point, crazy.. Available only ...</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>ham</td>\n", | |
" <td>Ok lar... Joking wif u oni...</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>spam</td>\n", | |
" <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>ham</td>\n", | |
" <td>U dun say so early hor... U c already then say...</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>ham</td>\n", | |
" <td>Nah I don't think he goes to usf, he lives aro...</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>spam</td>\n", | |
" <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>ham</td>\n", | |
" <td>Even my brother is not like to speak with me. ...</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>ham</td>\n", | |
" <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>spam</td>\n", | |
" <td>WINNER!! As a valued network customer you have...</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>spam</td>\n", | |
" <td>Had your mobile 11 months or more? U R entitle...</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" label message label_num\n", | |
"0 ham Go until jurong point, crazy.. Available only ... 0\n", | |
"1 ham Ok lar... Joking wif u oni... 0\n", | |
"2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n", | |
"3 ham U dun say so early hor... U c already then say... 0\n", | |
"4 ham Nah I don't think he goes to usf, he lives aro... 0\n", | |
"5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n", | |
"6 ham Even my brother is not like to speak with me. ... 0\n", | |
"7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n", | |
"8 spam WINNER!! As a valued network customer you have... 1\n", | |
"9 spam Had your mobile 11 months or more? U R entitle... 1" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# check that the conversion worked\n", | |
"sms.head(10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(150, 4)\n", | |
"(150,)\n" | |
] | |
} | |
], | |
"source": [ | |
"# how to define X and y (from the iris data) for use with a MODEL\n", | |
"X = iris.data\n", | |
"y = iris.target\n", | |
"print(X.shape)\n", | |
"print(y.shape)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(5572,)\n", | |
"(5572,)\n" | |
] | |
} | |
], | |
"source": [ | |
"# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", | |
"X = sms.message\n", | |
"y = sms.label_num\n", | |
"print(X.shape)\n", | |
"print(y.shape)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(4179,)\n", | |
"(1393,)\n", | |
"(4179,)\n", | |
"(1393,)\n" | |
] | |
} | |
], | |
"source": [ | |
"# split X and y into training and testing sets\n", | |
"from sklearn.model_selection import train_test_split\n", | |
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", | |
"print(X_train.shape)\n", | |
"print(X_test.shape)\n", | |
"print(y_train.shape)\n", | |
"print(y_test.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"On the other hand if you use **random_state=some_number**, then you can guarantee that the output of **Run 1** will be equal to the output of **Run 2**, i.e. your split will be always the same. It doesn't matter what the actual **random_state number** is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the **random_state** to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 4: Vectorizing our dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# instantiate the vectorizer\n", | |
"vect = CountVectorizer()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# learn training data vocabulary, then use it to create a document-term matrix\n", | |
"vect.fit(X_train)\n", | |
"X_train_dtm = vect.transform(X_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 87, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# equivalently: combine fit and transform into a single step\n", | |
"X_train_dtm = vect.fit_transform(X_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 93, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['〨ud evening',\n", | |
" '〨ud',\n", | |
" 'èn',\n", | |
" 'zyada kisi',\n", | |
" 'zyada',\n", | |
" 'zouk with',\n", | |
" 'zouk',\n", | |
" 'zoom to',\n", | |
" 'zoom']" | |
] | |
}, | |
"execution_count": 93, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the fitted vocabulary\n", | |
"vect.get_feature_names()[:-10:-1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"7456" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# total feature count\n", | |
"len(vect.get_feature_names())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<4179x7456 sparse matrix of type '<class 'numpy.int64'>'\n", | |
"\twith 55209 stored elements in Compressed Sparse Row format>" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the document-term matrix\n", | |
"X_train_dtm" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<1393x7456 sparse matrix of type '<class 'numpy.int64'>'\n", | |
"\twith 17604 stored elements in Compressed Sparse Row format>" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# transform testing data (using fitted vocabulary) into a document-term matrix\n", | |
"X_test_dtm = vect.transform(X_test)\n", | |
"X_test_dtm" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 5: Building and evaluating a model\n", | |
"\n", | |
"We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", | |
"\n", | |
"> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as **tf-idf** may also work.\n", | |
"\n", | |
"Meaning of *discrete* - :\tseparate, distinct, individual, detached, unattached, disconnected, discontinuous, disjunct, disjoined" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# import and instantiate a Multinomial Naive Bayes model\n", | |
"from sklearn.naive_bayes import MultinomialNB\n", | |
"nb = MultinomialNB()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n", | |
"Wall time: 6.78 ms\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", | |
"%time nb.fit(X_train_dtm, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# make class predictions for X_test_dtm\n", | |
"y_pred_class = nb.predict(X_test_dtm)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.98851399856424982" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate accuracy of class predictions\n", | |
"from sklearn import metrics\n", | |
"metrics.accuracy_score(y_test, y_pred_class)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[1203, 5],\n", | |
" [ 11, 174]])" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# print the confusion matrix\n", | |
"metrics.confusion_matrix(y_test, y_pred_class)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"574 Waiting for your call.\n", | |
"3375 Also andros ice etc etc\n", | |
"45 No calls..messages..missed calls\n", | |
"3415 No pic. Please re-send.\n", | |
"1988 No calls..messages..missed calls\n", | |
"Name: message, dtype: object" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# print message text for the false positives (ham incorrectly classified as spam)\n", | |
"X_test[(y_pred_class==1) & (y_test==0)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"574 Waiting for your call.\n", | |
"3375 Also andros ice etc etc\n", | |
"45 No calls..messages..missed calls\n", | |
"3415 No pic. Please re-send.\n", | |
"1988 No calls..messages..missed calls\n", | |
"Name: message, dtype: object" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# print message text for the false positives (ham incorrectly classified as spam)\n", | |
"X_test[(y_pred_class > y_test)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3132 LookAtMe!: Thanks for your purchase of a video...\n", | |
"5 FreeMsg Hey there darling it's been 3 week's n...\n", | |
"3530 Xmas & New Years Eve tickets are now on sale f...\n", | |
"684 Hi I'm sue. I am 20 years old and work as a la...\n", | |
"1875 Would you like to see my XXX pics they are so ...\n", | |
"1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n", | |
"4298 thesmszone.com lets you send free anonymous an...\n", | |
"4949 Hi this is Amy, we will be sending you a free ...\n", | |
"2821 INTERFLORA - It's not too late to order Inter...\n", | |
"2247 Hi ya babe x u 4goten bout me?' scammers getti...\n", | |
"4514 Money i have won wining number 946 wot do i do...\n", | |
"Name: message, dtype: object" | |
] | |
}, | |
"execution_count": 46, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# print message text for the false negatives (spam incorrectly classified as ham)\n", | |
"X_test[(y_pred_class < y_test)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"\"Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!\"" | |
] | |
}, | |
"execution_count": 47, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# example false negative\n", | |
"X_test[2247]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n", | |
" 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", | |
"y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", | |
"y_pred_prob" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.98664310005369615" | |
] | |
}, | |
"execution_count": 49, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate AUC\n", | |
"metrics.roc_auc_score(y_test, y_pred_prob)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 6: Comparing models\n", | |
"\n", | |
"We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", | |
"\n", | |
"> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# import and instantiate a logistic regression model\n", | |
"from sklearn.linear_model import LogisticRegression\n", | |
"logreg = LogisticRegression()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 40 ms, sys: 12 ms, total: 52 ms\n", | |
"Wall time: 53.4 ms\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", | |
" intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", | |
" penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", | |
" verbose=0, warm_start=False)" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# train the model using X_train_dtm\n", | |
"%time logreg.fit(X_train_dtm, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# make class predictions for X_test_dtm\n", | |
"y_pred_class = logreg.predict(X_test_dtm)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n", | |
" 0.99725053, 0.00157706])" | |
] | |
}, | |
"execution_count": 54, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate predicted probabilities for X_test_dtm (well calibrated)\n", | |
"y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", | |
"y_pred_prob" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9877961234745154" | |
] | |
}, | |
"execution_count": 55, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate accuracy\n", | |
"metrics.accuracy_score(y_test, y_pred_class)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.99368176123143015" | |
] | |
}, | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate AUC\n", | |
"metrics.roc_auc_score(y_test, y_pred_prob)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 7: Examining a model for further insight\n", | |
"\n", | |
"We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"7456" | |
] | |
}, | |
"execution_count": 57, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# store the vocabulary of X_train\n", | |
"X_train_tokens = vect.get_feature_names()\n", | |
"len(X_train_tokens)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']\n" | |
] | |
} | |
], | |
"source": [ | |
"# examine the first 50 tokens\n", | |
"print(X_train_tokens[0:50])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']\n" | |
] | |
} | |
], | |
"source": [ | |
"# examine the last 50 tokens\n", | |
"print(X_train_tokens[-50:])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 0., 0., 0., ..., 1., 1., 1.],\n", | |
" [ 5., 23., 2., ..., 0., 0., 0.]])" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Naive Bayes counts the number of times each token appears in each class\n", | |
"nb.feature_count_" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(2, 7456)" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# rows represent classes, columns represent tokens\n", | |
"nb.feature_count_.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 0., 0., 0., ..., 1., 1., 1.])" | |
] | |
}, | |
"execution_count": 63, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# number of times each token appears across all HAM messages\n", | |
"ham_token_count = nb.feature_count_[0, :]\n", | |
"ham_token_count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 5., 23., 2., ..., 0., 0., 0.])" | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# number of times each token appears across all SPAM messages\n", | |
"spam_token_count = nb.feature_count_[1, :]\n", | |
"spam_token_count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 68, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ham</th>\n", | |
" <th>spam</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>token</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>00</th>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>000</th>\n", | |
" <td>0</td>\n", | |
" <td>23</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>008704050406</th>\n", | |
" <td>0</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0121</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>01223585236</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ham spam\n", | |
"token \n", | |
"00 0 5\n", | |
"000 0 23\n", | |
"008704050406 0 2\n", | |
"0121 0 1\n", | |
"01223585236 0 1" | |
] | |
}, | |
"execution_count": 68, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# create a DataFrame of tokens with their separate ham and spam counts\n", | |
"tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')\n", | |
"tokens.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 69, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ham</th>\n", | |
" <th>spam</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>token</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>very</th>\n", | |
" <td>64</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>nasty</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>villa</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>beloved</th>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>textoperator</th>\n", | |
" <td>0</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ham spam\n", | |
"token \n", | |
"very 64 2\n", | |
"nasty 1 1\n", | |
"villa 0 1\n", | |
"beloved 1 0\n", | |
"textoperator 0 2" | |
] | |
}, | |
"execution_count": 69, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine 5 random DataFrame rows\n", | |
"tokens.sample(5, random_state=6)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 70, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 3617., 562.])" | |
] | |
}, | |
"execution_count": 70, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Naive Bayes counts the number of observations in each class\n", | |
"nb.class_count_" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Before we can calculate the \"spamminess\" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 71, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ham</th>\n", | |
" <th>spam</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>token</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>very</th>\n", | |
" <td>65</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>nasty</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>villa</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>beloved</th>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>textoperator</th>\n", | |
" <td>1</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ham spam\n", | |
"token \n", | |
"very 65 3\n", | |
"nasty 2 2\n", | |
"villa 1 2\n", | |
"beloved 2 1\n", | |
"textoperator 1 3" | |
] | |
}, | |
"execution_count": 71, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# add 1 to ham and spam counts to avoid dividing by 0\n", | |
"tokens['ham'] = tokens.ham + 1\n", | |
"tokens['spam'] = tokens.spam + 1\n", | |
"tokens.sample(5, random_state=6)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 72, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ham</th>\n", | |
" <th>spam</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>token</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>very</th>\n", | |
" <td>0.017971</td>\n", | |
" <td>0.005338</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>nasty</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.003559</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>villa</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.003559</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>beloved</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.001779</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>textoperator</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.005338</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ham spam\n", | |
"token \n", | |
"very 0.017971 0.005338\n", | |
"nasty 0.000553 0.003559\n", | |
"villa 0.000276 0.003559\n", | |
"beloved 0.000553 0.001779\n", | |
"textoperator 0.000276 0.005338" | |
] | |
}, | |
"execution_count": 72, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# convert the ham and spam counts into frequencies\n", | |
"tokens['ham'] = tokens.ham / nb.class_count_[0]\n", | |
"tokens['spam'] = tokens.spam / nb.class_count_[1]\n", | |
"tokens.sample(5, random_state=6)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 73, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ham</th>\n", | |
" <th>spam</th>\n", | |
" <th>spam_ratio</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>token</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>very</th>\n", | |
" <td>0.017971</td>\n", | |
" <td>0.005338</td>\n", | |
" <td>0.297044</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>nasty</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.003559</td>\n", | |
" <td>6.435943</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>villa</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.003559</td>\n", | |
" <td>12.871886</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>beloved</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>3.217972</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>textoperator</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.005338</td>\n", | |
" <td>19.307829</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ham spam spam_ratio\n", | |
"token \n", | |
"very 0.017971 0.005338 0.297044\n", | |
"nasty 0.000553 0.003559 6.435943\n", | |
"villa 0.000276 0.003559 12.871886\n", | |
"beloved 0.000553 0.001779 3.217972\n", | |
"textoperator 0.000276 0.005338 19.307829" | |
] | |
}, | |
"execution_count": 73, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# calculate the ratio of spam-to-ham for each token\n", | |
"tokens['spam_ratio'] = tokens.spam / tokens.ham\n", | |
"tokens.sample(5, random_state=6)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 74, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ham</th>\n", | |
" <th>spam</th>\n", | |
" <th>spam_ratio</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>token</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>claim</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.158363</td>\n", | |
" <td>572.798932</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>prize</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.135231</td>\n", | |
" <td>489.131673</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>150p</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.087189</td>\n", | |
" <td>315.361210</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>tone</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.085409</td>\n", | |
" <td>308.925267</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>guaranteed</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.076512</td>\n", | |
" <td>276.745552</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>18</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.069395</td>\n", | |
" <td>251.001779</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cs</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.065836</td>\n", | |
" <td>238.129893</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>www</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.129893</td>\n", | |
" <td>234.911922</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1000</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.056940</td>\n", | |
" <td>205.950178</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>awarded</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.053381</td>\n", | |
" <td>193.078292</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>150ppm</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.051601</td>\n", | |
" <td>186.642349</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>uk</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.099644</td>\n", | |
" <td>180.206406</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>500</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.048043</td>\n", | |
" <td>173.770463</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>ringtone</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.044484</td>\n", | |
" <td>160.898577</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>000</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.042705</td>\n", | |
" <td>154.462633</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mob</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.042705</td>\n", | |
" <td>154.462633</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>co</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.078292</td>\n", | |
" <td>141.590747</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>collection</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.039146</td>\n", | |
" <td>141.590747</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>valid</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.037367</td>\n", | |
" <td>135.154804</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2000</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.037367</td>\n", | |
" <td>135.154804</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>800</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.037367</td>\n", | |
" <td>135.154804</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10p</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.037367</td>\n", | |
" <td>135.154804</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8007</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.035587</td>\n", | |
" <td>128.718861</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>0.000553</td>\n", | |
" <td>0.067616</td>\n", | |
" <td>122.282918</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>weekly</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.033808</td>\n", | |
" <td>122.282918</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>tones</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.032028</td>\n", | |
" <td>115.846975</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>land</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.032028</td>\n", | |
" <td>115.846975</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>http</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.032028</td>\n", | |
" <td>115.846975</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>national</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.030249</td>\n", | |
" <td>109.411032</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5000</th>\n", | |
" <td>0.000276</td>\n", | |
" <td>0.030249</td>\n", | |
" <td>109.411032</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>went</th>\n", | |
" <td>0.012718</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.139912</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>ll</th>\n", | |
" <td>0.052530</td>\n", | |
" <td>0.007117</td>\n", | |
" <td>0.135494</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>told</th>\n", | |
" <td>0.013824</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.128719</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>feel</th>\n", | |
" <td>0.013824</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.128719</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>gud</th>\n", | |
" <td>0.014100</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.126195</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>cos</th>\n", | |
" <td>0.014929</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.119184</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>but</th>\n", | |
" <td>0.090683</td>\n", | |
" <td>0.010676</td>\n", | |
" <td>0.117731</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>amp</th>\n", | |
" <td>0.015206</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.117017</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>something</th>\n", | |
" <td>0.015206</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.117017</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>sure</th>\n", | |
" <td>0.015206</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.117017</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>ok</th>\n", | |
" <td>0.061100</td>\n", | |
" <td>0.007117</td>\n", | |
" <td>0.116488</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>said</th>\n", | |
" <td>0.016312</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.109084</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>morning</th>\n", | |
" <td>0.016865</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.105507</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>yeah</th>\n", | |
" <td>0.017694</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.100562</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>lol</th>\n", | |
" <td>0.017694</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.100562</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>anything</th>\n", | |
" <td>0.017971</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.099015</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>my</th>\n", | |
" <td>0.150401</td>\n", | |
" <td>0.014235</td>\n", | |
" <td>0.094646</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>doing</th>\n", | |
" <td>0.019077</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.093275</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>way</th>\n", | |
" <td>0.019630</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.090647</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>ask</th>\n", | |
" <td>0.019630</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.090647</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>already</th>\n", | |
" <td>0.019630</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.090647</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>too</th>\n", | |
" <td>0.021841</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.081468</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>come</th>\n", | |
" <td>0.048936</td>\n", | |
" <td>0.003559</td>\n", | |
" <td>0.072723</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>later</th>\n", | |
" <td>0.030688</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.057981</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>lor</th>\n", | |
" <td>0.032900</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.054084</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>da</th>\n", | |
" <td>0.032900</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.054084</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>she</th>\n", | |
" <td>0.035665</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.049891</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>he</th>\n", | |
" <td>0.047000</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.037858</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>lt</th>\n", | |
" <td>0.064142</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.027741</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>gt</th>\n", | |
" <td>0.064971</td>\n", | |
" <td>0.001779</td>\n", | |
" <td>0.027387</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>7456 rows × 3 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ham spam spam_ratio\n", | |
"token \n", | |
"claim 0.000276 0.158363 572.798932\n", | |
"prize 0.000276 0.135231 489.131673\n", | |
"150p 0.000276 0.087189 315.361210\n", | |
"tone 0.000276 0.085409 308.925267\n", | |
"guaranteed 0.000276 0.076512 276.745552\n", | |
"18 0.000276 0.069395 251.001779\n", | |
"cs 0.000276 0.065836 238.129893\n", | |
"www 0.000553 0.129893 234.911922\n", | |
"1000 0.000276 0.056940 205.950178\n", | |
"awarded 0.000276 0.053381 193.078292\n", | |
"150ppm 0.000276 0.051601 186.642349\n", | |
"uk 0.000553 0.099644 180.206406\n", | |
"500 0.000276 0.048043 173.770463\n", | |
"ringtone 0.000276 0.044484 160.898577\n", | |
"000 0.000276 0.042705 154.462633\n", | |
"mob 0.000276 0.042705 154.462633\n", | |
"co 0.000553 0.078292 141.590747\n", | |
"collection 0.000276 0.039146 141.590747\n", | |
"valid 0.000276 0.037367 135.154804\n", | |
"2000 0.000276 0.037367 135.154804\n", | |
"800 0.000276 0.037367 135.154804\n", | |
"10p 0.000276 0.037367 135.154804\n", | |
"8007 0.000276 0.035587 128.718861\n", | |
"16 0.000553 0.067616 122.282918\n", | |
"weekly 0.000276 0.033808 122.282918\n", | |
"tones 0.000276 0.032028 115.846975\n", | |
"land 0.000276 0.032028 115.846975\n", | |
"http 0.000276 0.032028 115.846975\n", | |
"national 0.000276 0.030249 109.411032\n", | |
"5000 0.000276 0.030249 109.411032\n", | |
"... ... ... ...\n", | |
"went 0.012718 0.001779 0.139912\n", | |
"ll 0.052530 0.007117 0.135494\n", | |
"told 0.013824 0.001779 0.128719\n", | |
"feel 0.013824 0.001779 0.128719\n", | |
"gud 0.014100 0.001779 0.126195\n", | |
"cos 0.014929 0.001779 0.119184\n", | |
"but 0.090683 0.010676 0.117731\n", | |
"amp 0.015206 0.001779 0.117017\n", | |
"something 0.015206 0.001779 0.117017\n", | |
"sure 0.015206 0.001779 0.117017\n", | |
"ok 0.061100 0.007117 0.116488\n", | |
"said 0.016312 0.001779 0.109084\n", | |
"morning 0.016865 0.001779 0.105507\n", | |
"yeah 0.017694 0.001779 0.100562\n", | |
"lol 0.017694 0.001779 0.100562\n", | |
"anything 0.017971 0.001779 0.099015\n", | |
"my 0.150401 0.014235 0.094646\n", | |
"doing 0.019077 0.001779 0.093275\n", | |
"way 0.019630 0.001779 0.090647\n", | |
"ask 0.019630 0.001779 0.090647\n", | |
"already 0.019630 0.001779 0.090647\n", | |
"too 0.021841 0.001779 0.081468\n", | |
"come 0.048936 0.003559 0.072723\n", | |
"later 0.030688 0.001779 0.057981\n", | |
"lor 0.032900 0.001779 0.054084\n", | |
"da 0.032900 0.001779 0.054084\n", | |
"she 0.035665 0.001779 0.049891\n", | |
"he 0.047000 0.001779 0.037858\n", | |
"lt 0.064142 0.001779 0.027741\n", | |
"gt 0.064971 0.001779 0.027387\n", | |
"\n", | |
"[7456 rows x 3 columns]" | |
] | |
}, | |
"execution_count": 74, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# examine the DataFrame sorted by spam_ratio\n", | |
"# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n", | |
"tokens.sort_values('spam_ratio', ascending=False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 76, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"83.667259786476862" | |
] | |
}, | |
"execution_count": 76, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# look up the spam_ratio for a given token\n", | |
"tokens.loc['dating', 'spam_ratio']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 8: Practicing this workflow on another dataset\n", | |
"\n", | |
"Please open the **`exercise.ipynb`** notebook (or the **`exercise.py`** script)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Part 9: Tuning the vectorizer (discussion)\n", | |
"\n", | |
"Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 77, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", | |
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", | |
" tokenizer=None, vocabulary=None)" | |
] | |
}, | |
"execution_count": 77, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# show default parameters for CountVectorizer\n", | |
"vect" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n", | |
"\n", | |
"- **stop_words:** string {'english'}, list, or None (default)\n", | |
" - If 'english', a built-in stop word list for English is used.\n", | |
" - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n", | |
" - If None, no stop words will be used." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 78, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# remove English stop words\n", | |
"vect = CountVectorizer(stop_words='english')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n", | |
" - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n", | |
" - All values of n such that min_n <= n <= max_n will be used." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 79, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# include 1-grams and 2-grams\n", | |
"vect = CountVectorizer(ngram_range=(1, 2))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n", | |
" - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n", | |
" - If float, the parameter represents a proportion of documents.\n", | |
" - If integer, the parameter represents an absolute count." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# ignore terms that appear in more than 50% of the documents\n", | |
"vect = CountVectorizer(max_df=0.5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- **min_df:** float in range [0.0, 1.0] or int, default=1\n", | |
" - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called \"cut-off\" in the literature.)\n", | |
" - If float, the parameter represents a proportion of documents.\n", | |
" - If integer, the parameter represents an absolute count." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# only keep terms that appear in at least 2 documents\n", | |
"vect = CountVectorizer(min_df=2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Guidelines for tuning CountVectorizer:**\n", | |
"\n", | |
"- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n", | |
"- **Experiment**, and let the data tell you the best approach!" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment