Skip to content

Instantly share code, notes, and snippets.

@AlJohri
Created March 3, 2016 02:07
Show Gist options
  • Save AlJohri/0ed6c1b01c705a0e9def to your computer and use it in GitHub Desktop.
Save AlJohri/0ed6c1b01c705a0e9def to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive Bayes Implementation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from __future__ import division # ensure that all division is float division\n",
"from __future__ import print_function # print function works properly when used with paranthesis\n",
"\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import os, sys, re\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"pd.set_option(\"display.max_colwidth\", 255)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Read in SMS Data.**\n",
"\n",
">The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.\n",
"\n",
">A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.\n",
"\n",
">A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.\n",
"\n",
"\n",
"- Primary Source: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/\n",
"- Secondary: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read in Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(5572, 2)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>message</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ham</td>\n",
" <td>Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ham</td>\n",
" <td>Ok lar... Joking wif u oni...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>spam</td>\n",
" <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C's apply 08452810075over18's</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ham</td>\n",
" <td>U dun say so early hor... U c already then say...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ham</td>\n",
" <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label \\\n",
"0 ham \n",
"1 ham \n",
"2 spam \n",
"3 ham \n",
"4 ham \n",
"\n",
" message \n",
"0 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... \n",
"1 Ok lar... Joking wif u oni... \n",
"2 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's \n",
"3 U dun say so early hor... U c already then say... \n",
"4 Nah I don't think he goes to usf, he lives around here though "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"../data/sms.tsv\", sep=\"\\t\", names=['label', 'message'])\n",
"print(df.shape)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stratified Train Test Split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stratified means the proprtions of spam/ham in the train/test sets reflect the original dataset. You can see the percentage is about the same here."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(4457, 2) (1115, 2)\n"
]
},
{
"data": {
"text/plain": [
"(0.86582903298182634, 0.86636771300448434)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.cross_validation import train_test_split\n",
"train, test = train_test_split(df, test_size=0.2, stratify=df.label)\n",
"print(train.shape, test.shape)\n",
"train.label.value_counts()['ham'] / len(train), test.label.value_counts()['ham'] / len(test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create sample data frame and sample rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract two sample messages that we will use for testing in the functions below."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"spam | WELL DONE! Your 4* Costa Del Sol Holiday or £5000 await collection. Call 09050090044 Now toClaim. SAE, TCs, POBox334, Stockport, SK38xh, Cost£1.50/pm, Max10mins\n",
"ham | What's up my own oga. Left my phone at home and just saw ur messages. Hope you are good. Have a great weekend.\n"
]
}
],
"source": [
"sample_df = train.sample(2)\n",
"\n",
"sample_row1 = sample_df.iloc[0] # first row of sample_df\n",
"sample_row2 = sample_df.iloc[1] # second row of sample_df\n",
"\n",
"sample_message1 = sample_row1.message\n",
"sample_message2 = sample_row2.message\n",
"\n",
"print(sample_row1.label, \"|\", sample_message1)\n",
"print(sample_row2.label, \"|\", sample_message2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokenize Message"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use http://regex101.com to come up with regular expressions."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['costa', 'now', 'max10mins', 'call', 'await', 'well', 'sol', 'collection', 'or', 'pobox334', 'cost', 'done', 'sae', 'sk38xh', 'del', 'stockport', 'holiday', 'tcs', 'your', 'toclaim', 'pm']\n",
"['and', 'great', 'own', 'are', 'just', 'my', \"what's\", 'messages', 'up', 'weekend', 'ur', 'phone', 'good', 'at', 'have', 'saw', 'home', 'you', 'oga', 'hope', 'left']\n"
]
}
],
"source": [
"def tokenize(msg):\n",
" \"\"\"\n",
" input: \"Change again... It's e one next to escalator...\"\n",
" output: [\"change\", \"again\", \"it's\", \"one\", \"next\", \"to\", \"escalator\"]\n",
" \"\"\"\n",
" msg_lowered = msg.lower()\n",
" # at least two characters long, cannot start with number\n",
" all_tokens = re.findall(r\"\\b[a-z][a-z0-9']+\\b\", msg_lowered)\n",
" return list(set(all_tokens))\n",
"\n",
"tokens1 = tokenize(sample_message1)\n",
"tokens2 = tokenize(sample_message2)\n",
"\n",
"print(tokens1)\n",
"print(tokens2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Vectorize Message"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Walk through the steps of vectorizing a message outside of a function."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sample Message 1: WELL DONE! Your 4* Costa Del Sol Holiday or £5000 await collection. Call 09050090044 Now toClaim. SAE, TCs, POBox334, Stockport, SK38xh, Cost£1.50/pm, Max10mins\n",
"Tokens 1: ['costa', 'now', 'max10mins', 'call', 'await', 'well', 'sol', 'collection', 'or', 'pobox334', 'cost', 'done', 'sae', 'sk38xh', 'del', 'stockport', 'holiday', 'tcs', 'your', 'toclaim', 'pm']\n",
"Series 1:\n",
"await 1\n",
"call 1\n",
"collection 1\n",
"cost 1\n",
"costa 1\n",
"del 1\n",
"done 1\n",
"holiday 1\n",
"max10mins 1\n",
"now 1\n",
"or 1\n",
"pm 1\n",
"pobox334 1\n",
"sae 1\n",
"sk38xh 1\n",
"sol 1\n",
"stockport 1\n",
"tcs 1\n",
"toclaim 1\n",
"well 1\n",
"your 1\n",
"dtype: int64\n",
"\n",
"Sample Message 2: What's up my own oga. Left my phone at home and just saw ur messages. Hope you are good. Have a great weekend.\n",
"Tokens 2: ['and', 'great', 'own', 'are', 'just', 'my', \"what's\", 'messages', 'up', 'weekend', 'ur', 'phone', 'good', 'at', 'have', 'saw', 'home', 'you', 'oga', 'hope', 'left']\n",
"Series 2:\n",
"and 1\n",
"are 1\n",
"at 1\n",
"good 1\n",
"great 1\n",
"have 1\n",
"home 1\n",
"hope 1\n",
"just 1\n",
"left 1\n",
"messages 1\n",
"my 1\n",
"oga 1\n",
"own 1\n",
"phone 1\n",
"saw 1\n",
"up 1\n",
"ur 1\n",
"weekend 1\n",
"what's 1\n",
"you 1\n",
"dtype: int64\n",
"\n",
"Combine Series 1 and Series 2:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>and</th>\n",
" <th>are</th>\n",
" <th>at</th>\n",
" <th>await</th>\n",
" <th>call</th>\n",
" <th>collection</th>\n",
" <th>cost</th>\n",
" <th>costa</th>\n",
" <th>del</th>\n",
" <th>done</th>\n",
" <th>...</th>\n",
" <th>stockport</th>\n",
" <th>tcs</th>\n",
" <th>toclaim</th>\n",
" <th>up</th>\n",
" <th>ur</th>\n",
" <th>weekend</th>\n",
" <th>well</th>\n",
" <th>what's</th>\n",
" <th>you</th>\n",
" <th>your</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" and are at await call collection cost costa del done ... \\\n",
"0 0 0 0 1 1 1 1 1 1 1 ... \n",
"1 1 1 1 0 0 0 0 0 0 0 ... \n",
"\n",
" stockport tcs toclaim up ur weekend well what's you your \n",
"0 1 1 1 0 0 0 1 0 0 1 \n",
"1 0 0 0 1 1 1 0 1 1 0 \n",
"\n",
"[2 rows x 42 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"token_dict1 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}\n",
"for token in tokens1:\n",
" token_dict1[token] = 1 \n",
"series1 = pd.Series(token_dict1) # convert the dictionary into a series where the row labels are words\n",
"\n",
"# rewrite the same as above using a dict comprehension\n",
"series1 = pd.Series({token: 1 for token in tokens1})\n",
"\n",
"token_dict2 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}\n",
"for token in tokens2:\n",
" token_dict2[token] = 1 \n",
"series2 = pd.Series(token_dict2) # convert the dictionary into a series where the row labels are words\n",
"\n",
"# rewrite the same as above using a dict comprehension\n",
"series2 = pd.Series({token: 1 for token in tokens2})\n",
"\n",
"print(\"Sample Message 1:\", sample_message1)\n",
"print(\"Tokens 1:\", tokens1)\n",
"print(\"Series 1:\")\n",
"print(series1)\n",
"print()\n",
"print(\"Sample Message 2:\", sample_message2)\n",
"print(\"Tokens 2:\", tokens2)\n",
"print(\"Series 2:\")\n",
"print(series2)\n",
"print()\n",
"\n",
"print(\"Combine Series 1 and Series 2:\")\n",
"df2 = pd.DataFrame([series1, series2]) # comebine the two \n",
"df2.fillna(0, inplace=True)\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Repeat the same process as above of tokenzing and then vectorizing using a function."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def vectorize_row(row):\n",
" \"\"\"\n",
" input: row in data frame with a \".message\" attribute\n",
" output: vectorized row where the row labels are words and the values are 1 for each row\n",
" \"\"\"\n",
" message = row.message\n",
" tokens = tokenize(message)\n",
" vectorized_row = pd.Series({token: 1 for token in tokens})\n",
" return vectorized_row"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"await 1\n",
"call 1\n",
"collection 1\n",
"cost 1\n",
"costa 1\n",
"del 1\n",
"done 1\n",
"holiday 1\n",
"max10mins 1\n",
"now 1\n",
"or 1\n",
"pm 1\n",
"pobox334 1\n",
"sae 1\n",
"sk38xh 1\n",
"sol 1\n",
"stockport 1\n",
"tcs 1\n",
"toclaim 1\n",
"well 1\n",
"your 1\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorize_row(sample_row1)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"and 1\n",
"are 1\n",
"at 1\n",
"good 1\n",
"great 1\n",
"have 1\n",
"home 1\n",
"hope 1\n",
"just 1\n",
"left 1\n",
"messages 1\n",
"my 1\n",
"oga 1\n",
"own 1\n",
"phone 1\n",
"saw 1\n",
"up 1\n",
"ur 1\n",
"weekend 1\n",
"what's 1\n",
"you 1\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorize_row(sample_row2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Feature Matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is input to our Naive Bayes model."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_feature_matrix(df):\n",
" feature_matrix = df.apply(vectorize_row, axis=1)\n",
" feature_matrix.fillna(0, inplace=True)\n",
" return feature_matrix"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>and</th>\n",
" <th>are</th>\n",
" <th>at</th>\n",
" <th>await</th>\n",
" <th>call</th>\n",
" <th>collection</th>\n",
" <th>cost</th>\n",
" <th>costa</th>\n",
" <th>del</th>\n",
" <th>done</th>\n",
" <th>...</th>\n",
" <th>stockport</th>\n",
" <th>tcs</th>\n",
" <th>toclaim</th>\n",
" <th>up</th>\n",
" <th>ur</th>\n",
" <th>weekend</th>\n",
" <th>well</th>\n",
" <th>what's</th>\n",
" <th>you</th>\n",
" <th>your</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1942</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4809</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" and are at await call collection cost costa del done ... \\\n",
"1942 0 0 0 1 1 1 1 1 1 1 ... \n",
"4809 1 1 1 0 0 0 0 0 0 0 ... \n",
"\n",
" stockport tcs toclaim up ur weekend well what's you your \n",
"1942 1 1 1 0 0 0 1 0 0 1 \n",
"4809 0 0 0 1 1 1 0 1 1 0 \n",
"\n",
"[2 rows x 42 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_feature_matrix(sample_df)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(4457, 7213)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix = get_feature_matrix(train)\n",
"feature_matrix.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'a21', u'a30', u'aa', u'aah', u'aaniye', u'aaooooright', u'aathi',\n",
" u'ab', u'abbey', u'abdomen', u'abeg', u'abel', u'aberdeen', u'abi',\n",
" u'ability', u'abiola', u'abj', u'able', u'about', u'aboutas', u'above',\n",
" u'abroad', u'absence', u'absolutely', u'absolutly', u'abstract', u'abt',\n",
" u'abta', u'aburo', u'abuse', u'abusers', u'ac', u'academic', u'acc',\n",
" u'accent', u'accenture', u'accept', u'access', u'accessible',\n",
" u'accidant', u'accident', u'accidentally', u'accommodation',\n",
" u'accommodationvouchers', u'accordin', u'accordingly', u'account',\n",
" u'account's', u'accounting', u'accounts'],\n",
" dtype='object')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.columns[:50]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index([u'yet', u'yetty's', u'yetunde', u'yhl', u'yi', u'yijue', u'ym', u'ymca',\n",
" u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'you'd',\n",
" u'you'ld', u'you'll', u'you're', u'you've', u'youdoing', u'young',\n",
" u'younger', u'your', u'your's', u'youre', u'yourinclusive', u'yourjob',\n",
" u'yours', u'yourself', u'youuuuu', u'yowifes', u'yr', u'yrs',\n",
" u'ystrday', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou',\n",
" u'yup', u'yupz', u'zac', u'zealand', u'zed', u'zhong', u'zoe',\n",
" u'zogtorius', u'zoom', u'zouk'],\n",
" dtype='object')"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.columns[-50:]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a21</th>\n",
" <th>a30</th>\n",
" <th>aa</th>\n",
" <th>aah</th>\n",
" <th>aaniye</th>\n",
" <th>aaooooright</th>\n",
" <th>aathi</th>\n",
" <th>ab</th>\n",
" <th>abbey</th>\n",
" <th>abdomen</th>\n",
" <th>...</th>\n",
" <th>yup</th>\n",
" <th>yupz</th>\n",
" <th>zac</th>\n",
" <th>zealand</th>\n",
" <th>zed</th>\n",
" <th>zhong</th>\n",
" <th>zoe</th>\n",
" <th>zogtorius</th>\n",
" <th>zoom</th>\n",
" <th>zouk</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1202</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>212</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3752</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5554</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 7213 columns</p>\n",
"</div>"
],
"text/plain": [
" a21 a30 aa aah aaniye aaooooright aathi ab abbey abdomen ... \\\n",
"889 0 0 0 0 0 0 0 0 0 0 ... \n",
"1202 0 0 0 0 0 0 0 0 0 0 ... \n",
"212 0 0 0 0 0 0 0 0 0 0 ... \n",
"3752 0 0 0 0 0 0 0 0 0 0 ... \n",
"5554 0 0 0 0 0 0 0 0 0 0 ... \n",
"\n",
" yup yupz zac zealand zed zhong zoe zogtorius zoom zouk \n",
"889 0 0 0 0 0 0 0 0 0 0 \n",
"1202 0 0 0 0 0 0 0 0 0 0 \n",
"212 0 0 0 0 0 0 0 0 0 0 \n",
"3752 0 0 0 0 0 0 0 0 0 0 \n",
"5554 0 0 0 0 0 0 0 0 0 0 \n",
"\n",
"[5 rows x 7213 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculate Feature Probabilities (Train/Fit Model)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_conditional_probability_for_word(col):\n",
" return col.sum() / len(col)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_feature_prob(feature_matrix):\n",
" \n",
" spam_boolean_mask = (df.label == \"spam\")\n",
" ham_boolean_mask = (df.label == \"ham\")\n",
" \n",
" # Explanation for \"confusing\" syntax:\n",
" # http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" \n",
" feature_matrix_spam = feature_matrix.loc[spam_boolean_mask, :] # get all rows for spam boolean mask\n",
" feature_matrix_ham = feature_matrix.loc[ham_boolean_mask, :] # get all rows for ham boolean mask\n",
" \n",
" # mymatrix[:, 0] is to get the first column\n",
" # mymatrix[:, 1] is to get the second column\n",
" \n",
" # mymatrix[0, :] is to get the first row\n",
" # mymatrix[1, :] is to get the second row\n",
" \n",
" # mymatrix[boolean_mask, :] is to get the rows where boolean_mask is True\n",
" \n",
" feature_prob_spam = feature_matrix_spam.apply(get_conditional_probability_for_word, axis=0)\n",
" feature_prob_ham = feature_matrix_ham.apply(get_conditional_probability_for_word, axis=0)\n",
" \n",
" feature_prob = pd.concat([feature_prob_spam, feature_prob_ham], axis=1)\n",
" feature_prob.columns = ['spam', 'ham']\n",
" \n",
" return feature_prob"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"(7213, 2)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_prob = get_feature_prob(feature_matrix)\n",
"feature_prob.shape"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>spam</th>\n",
" <th>ham</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a21</th>\n",
" <td>0.001672</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>a30</th>\n",
" <td>0.000000</td>\n",
" <td>0.000259</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aa</th>\n",
" <td>0.000000</td>\n",
" <td>0.000259</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aah</th>\n",
" <td>0.000000</td>\n",
" <td>0.000518</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aaniye</th>\n",
" <td>0.000000</td>\n",
" <td>0.000259</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" spam ham\n",
"a21 0.001672 0.000000\n",
"a30 0.000000 0.000259\n",
"aa 0.000000 0.000259\n",
"aah 0.000000 0.000518\n",
"aaniye 0.000000 0.000259"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_prob.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyze Feature Probabilities in Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Words with the largest conditional probability for predicting spam.\n",
"\n",
"P(w_i | y= \"spam\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>spam</th>\n",
" <th>ham</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>to</th>\n",
" <td>0.625418</td>\n",
" <td>0.253693</td>\n",
" </tr>\n",
" <tr>\n",
" <th>call</th>\n",
" <td>0.431438</td>\n",
" <td>0.047162</td>\n",
" </tr>\n",
" <tr>\n",
" <th>your</th>\n",
" <td>0.322742</td>\n",
" <td>0.073594</td>\n",
" </tr>\n",
" <tr>\n",
" <th>you</th>\n",
" <td>0.311037</td>\n",
" <td>0.276237</td>\n",
" </tr>\n",
" <tr>\n",
" <th>now</th>\n",
" <td>0.252508</td>\n",
" <td>0.060897</td>\n",
" </tr>\n",
" <tr>\n",
" <th>for</th>\n",
" <td>0.244147</td>\n",
" <td>0.094325</td>\n",
" </tr>\n",
" <tr>\n",
" <th>or</th>\n",
" <td>0.242475</td>\n",
" <td>0.045608</td>\n",
" </tr>\n",
" <tr>\n",
" <th>free</th>\n",
" <td>0.227425</td>\n",
" <td>0.013993</td>\n",
" </tr>\n",
" <tr>\n",
" <th>the</th>\n",
" <td>0.219064</td>\n",
" <td>0.182172</td>\n",
" </tr>\n",
" <tr>\n",
" <th>txt</th>\n",
" <td>0.204013</td>\n",
" <td>0.002850</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" spam ham\n",
"to 0.625418 0.253693\n",
"call 0.431438 0.047162\n",
"your 0.322742 0.073594\n",
"you 0.311037 0.276237\n",
"now 0.252508 0.060897\n",
"for 0.244147 0.094325\n",
"or 0.242475 0.045608\n",
"free 0.227425 0.013993\n",
"the 0.219064 0.182172\n",
"txt 0.204013 0.002850"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_prob.sort_values(by='spam', ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Words with the smallest conditional probability for predicting ham.\n",
"\n",
"P(w_i | y= \"ham\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>spam</th>\n",
" <th>ham</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>a21</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lastest</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>largest</th>\n",
" <td>0.006689</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>large</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>landmark</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>landlines</th>\n",
" <td>0.003344</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>land</th>\n",
" <td>0.016722</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>la32wu</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>la3</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>la1</th>\n",
" <td>0.001672</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" spam ham\n",
"a21 0.001672 0\n",
"lastest 0.001672 0\n",
"largest 0.006689 0\n",
"large 0.001672 0\n",
"landmark 0.001672 0\n",
"landlines 0.003344 0\n",
"land 0.016722 0\n",
"la32wu 0.001672 0\n",
"la3 0.001672 0\n",
"la1 0.001672 0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_prob.sort_values(by='ham', ascending=True).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Key Takeaway**: These models are trained looking only at one class at a time, so the largest conditional probabilities may end up being common stop words. However, this will occur in both classes which ends up \"cancelling out\". The stop words won't predict one way or the other. Instead, looking at the least predictive words of the opposite class - in this case the words least predictive of \"ham\" will show us highly predictive spam words."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>message</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1673</th>\n",
" <td>spam</td>\n",
" <td>URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label \\\n",
"1673 spam \n",
"\n",
" message \n",
"1673 URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.message.str.contains(\"a21\", case=False)]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>message</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4373</th>\n",
" <td>spam</td>\n",
" <td>Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label \\\n",
"4373 spam \n",
"\n",
" message \n",
"4373 Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck! "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.message.str.contains(\"landmark\", case=False)]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>message</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3998</th>\n",
" <td>spam</td>\n",
" <td>Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4864</th>\n",
" <td>spam</td>\n",
" <td>Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label \\\n",
"3998 spam \n",
"4864 spam \n",
"\n",
" message \n",
"3998 Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines! \n",
"4864 Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines! "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.message.str.contains(\"landlines\", case=False)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict Test Data"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# ...."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment