AlJohri · March 3, 2016 02:07
diff --git a/14_naive_bayes_implementation.ipynb b/14_naive_bayes_implementation.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from __future__ import division # ensure that all division is float division\n",
    "from __future__ import print_function # print function works properly when used with paranthesis\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import os, sys, re\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "\n",
    "pd.set_option(\"display.max_colwidth\", 255)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Read in SMS Data.**\n",
    "\n",
    ">The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.\n",
    "\n",
    ">A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.\n",
    "\n",
    ">A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.\n",
    "\n",
    "\n",
    "- Primary Source: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/\n",
    "- Secondary: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read in Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(5572, 2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C's apply 08452810075over18's</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives around here though</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label  \\\n",
       "0   ham   \n",
       "1   ham   \n",
       "2  spam   \n",
       "3   ham   \n",
       "4   ham   \n",
       "\n",
       "                                                                                                                                                       message  \n",
       "0                                              Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...  \n",
       "1                                                                                                                                Ok lar... Joking wif u oni...  \n",
       "2  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's  \n",
       "3                                                                                                            U dun say so early hor... U c already then say...  \n",
       "4                                                                                                Nah I don't think he goes to usf, he lives around here though  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"../data/sms.tsv\", sep=\"\\t\", names=['label', 'message'])\n",
    "print(df.shape)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stratified Train Test Split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Stratified means the proprtions of spam/ham in the train/test sets reflect the original dataset. You can see the percentage is about the same here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4457, 2) (1115, 2)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(0.86582903298182634, 0.86636771300448434)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "train, test = train_test_split(df, test_size=0.2, stratify=df.label)\n",
    "print(train.shape, test.shape)\n",
    "train.label.value_counts()['ham'] / len(train), test.label.value_counts()['ham'] / len(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create sample data frame and sample rows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extract two sample messages that we will use for testing in the functions below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "spam | WELL DONE! Your 4* Costa Del Sol Holiday or £5000 await collection. Call 09050090044 Now toClaim. SAE, TCs, POBox334, Stockport, SK38xh, Cost£1.50/pm, Max10mins\n",
      "ham | What's up my own oga. Left my phone at home and just saw ur messages. Hope you are good. Have a great weekend.\n"
     ]
    }
   ],
   "source": [
    "sample_df = train.sample(2)\n",
    "\n",
    "sample_row1 = sample_df.iloc[0] # first row of sample_df\n",
    "sample_row2 = sample_df.iloc[1] # second row of sample_df\n",
    "\n",
    "sample_message1 = sample_row1.message\n",
    "sample_message2 = sample_row2.message\n",
    "\n",
    "print(sample_row1.label, \"|\", sample_message1)\n",
    "print(sample_row2.label, \"|\", sample_message2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenize Message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use http://regex101.com to come up with regular expressions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['costa', 'now', 'max10mins', 'call', 'await', 'well', 'sol', 'collection', 'or', 'pobox334', 'cost', 'done', 'sae', 'sk38xh', 'del', 'stockport', 'holiday', 'tcs', 'your', 'toclaim', 'pm']\n",
      "['and', 'great', 'own', 'are', 'just', 'my', \"what's\", 'messages', 'up', 'weekend', 'ur', 'phone', 'good', 'at', 'have', 'saw', 'home', 'you', 'oga', 'hope', 'left']\n"
     ]
    }
   ],
   "source": [
    "def tokenize(msg):\n",
    "    \"\"\"\n",
    "    input: \"Change again... It's e one next to escalator...\"\n",
    "    output: [\"change\", \"again\", \"it's\", \"one\", \"next\", \"to\", \"escalator\"]\n",
    "    \"\"\"\n",
    "    msg_lowered = msg.lower()\n",
    "    # at least two characters long, cannot start with number\n",
    "    all_tokens = re.findall(r\"\\b[a-z][a-z0-9']+\\b\", msg_lowered)\n",
    "    return list(set(all_tokens))\n",
    "\n",
    "tokens1 = tokenize(sample_message1)\n",
    "tokens2 = tokenize(sample_message2)\n",
    "\n",
    "print(tokens1)\n",
    "print(tokens2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vectorize Message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Walk through the steps of vectorizing a message outside of a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample Message 1: WELL DONE! Your 4* Costa Del Sol Holiday or £5000 await collection. Call 09050090044 Now toClaim. SAE, TCs, POBox334, Stockport, SK38xh, Cost£1.50/pm, Max10mins\n",
      "Tokens 1: ['costa', 'now', 'max10mins', 'call', 'await', 'well', 'sol', 'collection', 'or', 'pobox334', 'cost', 'done', 'sae', 'sk38xh', 'del', 'stockport', 'holiday', 'tcs', 'your', 'toclaim', 'pm']\n",
      "Series 1:\n",
      "await         1\n",
      "call          1\n",
      "collection    1\n",
      "cost          1\n",
      "costa         1\n",
      "del           1\n",
      "done          1\n",
      "holiday       1\n",
      "max10mins     1\n",
      "now           1\n",
      "or            1\n",
      "pm            1\n",
      "pobox334      1\n",
      "sae           1\n",
      "sk38xh        1\n",
      "sol           1\n",
      "stockport     1\n",
      "tcs           1\n",
      "toclaim       1\n",
      "well          1\n",
      "your          1\n",
      "dtype: int64\n",
      "\n",
      "Sample Message 2: What's up my own oga. Left my phone at home and just saw ur messages. Hope you are good. Have a great weekend.\n",
      "Tokens 2: ['and', 'great', 'own', 'are', 'just', 'my', \"what's\", 'messages', 'up', 'weekend', 'ur', 'phone', 'good', 'at', 'have', 'saw', 'home', 'you', 'oga', 'hope', 'left']\n",
      "Series 2:\n",
      "and         1\n",
      "are         1\n",
      "at          1\n",
      "good        1\n",
      "great       1\n",
      "have        1\n",
      "home        1\n",
      "hope        1\n",
      "just        1\n",
      "left        1\n",
      "messages    1\n",
      "my          1\n",
      "oga         1\n",
      "own         1\n",
      "phone       1\n",
      "saw         1\n",
      "up          1\n",
      "ur          1\n",
      "weekend     1\n",
      "what's      1\n",
      "you         1\n",
      "dtype: int64\n",
      "\n",
      "Combine Series 1 and Series 2:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>and</th>\n",
       "      <th>are</th>\n",
       "      <th>at</th>\n",
       "      <th>await</th>\n",
       "      <th>call</th>\n",
       "      <th>collection</th>\n",
       "      <th>cost</th>\n",
       "      <th>costa</th>\n",
       "      <th>del</th>\n",
       "      <th>done</th>\n",
       "      <th>...</th>\n",
       "      <th>stockport</th>\n",
       "      <th>tcs</th>\n",
       "      <th>toclaim</th>\n",
       "      <th>up</th>\n",
       "      <th>ur</th>\n",
       "      <th>weekend</th>\n",
       "      <th>well</th>\n",
       "      <th>what's</th>\n",
       "      <th>you</th>\n",
       "      <th>your</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 42 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   and  are  at  await  call  collection  cost  costa  del  done  ...   \\\n",
       "0    0    0   0      1     1           1     1      1    1     1  ...    \n",
       "1    1    1   1      0     0           0     0      0    0     0  ...    \n",
       "\n",
       "   stockport  tcs  toclaim  up  ur  weekend  well  what's  you  your  \n",
       "0          1    1        1   0   0        0     1       0    0     1  \n",
       "1          0    0        0   1   1        1     0       1    1     0  \n",
       "\n",
       "[2 rows x 42 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "token_dict1 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}\n",
    "for token in tokens1:\n",
    "    token_dict1[token] = 1 \n",
    "series1 = pd.Series(token_dict1) # convert the dictionary into a series where the row labels are words\n",
    "\n",
    "# rewrite the same as above using a dict comprehension\n",
    "series1 = pd.Series({token: 1 for token in tokens1})\n",
    "\n",
    "token_dict2 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}\n",
    "for token in tokens2:\n",
    "    token_dict2[token] = 1 \n",
    "series2 = pd.Series(token_dict2) # convert the dictionary into a series where the row labels are words\n",
    "\n",
    "# rewrite the same as above using a dict comprehension\n",
    "series2 = pd.Series({token: 1 for token in tokens2})\n",
    "\n",
    "print(\"Sample Message 1:\", sample_message1)\n",
    "print(\"Tokens 1:\", tokens1)\n",
    "print(\"Series 1:\")\n",
    "print(series1)\n",
    "print()\n",
    "print(\"Sample Message 2:\", sample_message2)\n",
    "print(\"Tokens 2:\", tokens2)\n",
    "print(\"Series 2:\")\n",
    "print(series2)\n",
    "print()\n",
    "\n",
    "print(\"Combine Series 1 and Series 2:\")\n",
    "df2 = pd.DataFrame([series1, series2]) # comebine the two \n",
    "df2.fillna(0, inplace=True)\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Repeat the same process as above of tokenzing and then vectorizing using a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def vectorize_row(row):\n",
    "    \"\"\"\n",
    "    input: row in data frame with a \".message\" attribute\n",
    "    output: vectorized row where the row labels are words and the values are 1 for each row\n",
    "    \"\"\"\n",
    "    message = row.message\n",
    "    tokens = tokenize(message)\n",
    "    vectorized_row = pd.Series({token: 1 for token in tokens})\n",
    "    return vectorized_row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "await         1\n",
       "call          1\n",
       "collection    1\n",
       "cost          1\n",
       "costa         1\n",
       "del           1\n",
       "done          1\n",
       "holiday       1\n",
       "max10mins     1\n",
       "now           1\n",
       "or            1\n",
       "pm            1\n",
       "pobox334      1\n",
       "sae           1\n",
       "sk38xh        1\n",
       "sol           1\n",
       "stockport     1\n",
       "tcs           1\n",
       "toclaim       1\n",
       "well          1\n",
       "your          1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorize_row(sample_row1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "and         1\n",
       "are         1\n",
       "at          1\n",
       "good        1\n",
       "great       1\n",
       "have        1\n",
       "home        1\n",
       "hope        1\n",
       "just        1\n",
       "left        1\n",
       "messages    1\n",
       "my          1\n",
       "oga         1\n",
       "own         1\n",
       "phone       1\n",
       "saw         1\n",
       "up          1\n",
       "ur          1\n",
       "weekend     1\n",
       "what's      1\n",
       "you         1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorize_row(sample_row2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Feature Matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is input to our Naive Bayes model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_feature_matrix(df):\n",
    "    feature_matrix = df.apply(vectorize_row, axis=1)\n",
    "    feature_matrix.fillna(0, inplace=True)\n",
    "    return feature_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>and</th>\n",
       "      <th>are</th>\n",
       "      <th>at</th>\n",
       "      <th>await</th>\n",
       "      <th>call</th>\n",
       "      <th>collection</th>\n",
       "      <th>cost</th>\n",
       "      <th>costa</th>\n",
       "      <th>del</th>\n",
       "      <th>done</th>\n",
       "      <th>...</th>\n",
       "      <th>stockport</th>\n",
       "      <th>tcs</th>\n",
       "      <th>toclaim</th>\n",
       "      <th>up</th>\n",
       "      <th>ur</th>\n",
       "      <th>weekend</th>\n",
       "      <th>well</th>\n",
       "      <th>what's</th>\n",
       "      <th>you</th>\n",
       "      <th>your</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1942</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4809</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 42 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      and  are  at  await  call  collection  cost  costa  del  done  ...   \\\n",
       "1942    0    0   0      1     1           1     1      1    1     1  ...    \n",
       "4809    1    1   1      0     0           0     0      0    0     0  ...    \n",
       "\n",
       "      stockport  tcs  toclaim  up  ur  weekend  well  what's  you  your  \n",
       "1942          1    1        1   0   0        0     1       0    0     1  \n",
       "4809          0    0        0   1   1        1     0       1    1     0  \n",
       "\n",
       "[2 rows x 42 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_feature_matrix(sample_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4457, 7213)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix = get_feature_matrix(train)\n",
    "feature_matrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([u'a21', u'a30', u'aa', u'aah', u'aaniye', u'aaooooright', u'aathi',\n",
       "       u'ab', u'abbey', u'abdomen', u'abeg', u'abel', u'aberdeen', u'abi',\n",
       "       u'ability', u'abiola', u'abj', u'able', u'about', u'aboutas', u'above',\n",
       "       u'abroad', u'absence', u'absolutely', u'absolutly', u'abstract', u'abt',\n",
       "       u'abta', u'aburo', u'abuse', u'abusers', u'ac', u'academic', u'acc',\n",
       "       u'accent', u'accenture', u'accept', u'access', u'accessible',\n",
       "       u'accidant', u'accident', u'accidentally', u'accommodation',\n",
       "       u'accommodationvouchers', u'accordin', u'accordingly', u'account',\n",
       "       u'account's', u'accounting', u'accounts'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.columns[:50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([u'yet', u'yetty's', u'yetunde', u'yhl', u'yi', u'yijue', u'ym', u'ymca',\n",
       "       u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'you'd',\n",
       "       u'you'ld', u'you'll', u'you're', u'you've', u'youdoing', u'young',\n",
       "       u'younger', u'your', u'your's', u'youre', u'yourinclusive', u'yourjob',\n",
       "       u'yours', u'yourself', u'youuuuu', u'yowifes', u'yr', u'yrs',\n",
       "       u'ystrday', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou',\n",
       "       u'yup', u'yupz', u'zac', u'zealand', u'zed', u'zhong', u'zoe',\n",
       "       u'zogtorius', u'zoom', u'zouk'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.columns[-50:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a21</th>\n",
       "      <th>a30</th>\n",
       "      <th>aa</th>\n",
       "      <th>aah</th>\n",
       "      <th>aaniye</th>\n",
       "      <th>aaooooright</th>\n",
       "      <th>aathi</th>\n",
       "      <th>ab</th>\n",
       "      <th>abbey</th>\n",
       "      <th>abdomen</th>\n",
       "      <th>...</th>\n",
       "      <th>yup</th>\n",
       "      <th>yupz</th>\n",
       "      <th>zac</th>\n",
       "      <th>zealand</th>\n",
       "      <th>zed</th>\n",
       "      <th>zhong</th>\n",
       "      <th>zoe</th>\n",
       "      <th>zogtorius</th>\n",
       "      <th>zoom</th>\n",
       "      <th>zouk</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1202</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>212</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3752</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5554</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 7213 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      a21  a30  aa  aah  aaniye  aaooooright  aathi  ab  abbey  abdomen  ...   \\\n",
       "889     0    0   0    0       0            0      0   0      0        0  ...    \n",
       "1202    0    0   0    0       0            0      0   0      0        0  ...    \n",
       "212     0    0   0    0       0            0      0   0      0        0  ...    \n",
       "3752    0    0   0    0       0            0      0   0      0        0  ...    \n",
       "5554    0    0   0    0       0            0      0   0      0        0  ...    \n",
       "\n",
       "      yup  yupz  zac  zealand  zed  zhong  zoe  zogtorius  zoom  zouk  \n",
       "889     0     0    0        0    0      0    0          0     0     0  \n",
       "1202    0     0    0        0    0      0    0          0     0     0  \n",
       "212     0     0    0        0    0      0    0          0     0     0  \n",
       "3752    0     0    0        0    0      0    0          0     0     0  \n",
       "5554    0     0    0        0    0      0    0          0     0     0  \n",
       "\n",
       "[5 rows x 7213 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculate Feature Probabilities (Train/Fit Model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_conditional_probability_for_word(col):\n",
    "    return col.sum() / len(col)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_feature_prob(feature_matrix):\n",
    "    \n",
    "    spam_boolean_mask = (df.label == \"spam\")\n",
    "    ham_boolean_mask = (df.label == \"ham\")\n",
    "    \n",
    "    # Explanation for \"confusing\" syntax:\n",
    "    # http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
    "    \n",
    "    feature_matrix_spam = feature_matrix.loc[spam_boolean_mask, :] # get all rows for spam boolean mask\n",
    "    feature_matrix_ham = feature_matrix.loc[ham_boolean_mask, :] # get all rows for ham boolean mask\n",
    "    \n",
    "    # mymatrix[:, 0] is to get the first column\n",
    "    # mymatrix[:, 1] is to get the second column\n",
    "    \n",
    "    # mymatrix[0, :] is to get the first row\n",
    "    # mymatrix[1, :] is to get the second row\n",
    "    \n",
    "    # mymatrix[boolean_mask, :] is to get the rows where boolean_mask is True\n",
    "    \n",
    "    feature_prob_spam = feature_matrix_spam.apply(get_conditional_probability_for_word, axis=0)\n",
    "    feature_prob_ham = feature_matrix_ham.apply(get_conditional_probability_for_word, axis=0)\n",
    "    \n",
    "    feature_prob = pd.concat([feature_prob_spam, feature_prob_ham], axis=1)\n",
    "    feature_prob.columns = ['spam', 'ham']\n",
    "    \n",
    "    return feature_prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7213, 2)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_prob = get_feature_prob(feature_matrix)\n",
    "feature_prob.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>spam</th>\n",
       "      <th>ham</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a21</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a30</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000259</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aa</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000259</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aah</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000518</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aaniye</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000259</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            spam       ham\n",
       "a21     0.001672  0.000000\n",
       "a30     0.000000  0.000259\n",
       "aa      0.000000  0.000259\n",
       "aah     0.000000  0.000518\n",
       "aaniye  0.000000  0.000259"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_prob.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analyze Feature Probabilities in Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Words with the largest conditional probability for predicting spam.\n",
    "\n",
    "P(w_i | y= \"spam\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>spam</th>\n",
       "      <th>ham</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>to</th>\n",
       "      <td>0.625418</td>\n",
       "      <td>0.253693</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>call</th>\n",
       "      <td>0.431438</td>\n",
       "      <td>0.047162</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>your</th>\n",
       "      <td>0.322742</td>\n",
       "      <td>0.073594</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>you</th>\n",
       "      <td>0.311037</td>\n",
       "      <td>0.276237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>now</th>\n",
       "      <td>0.252508</td>\n",
       "      <td>0.060897</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>for</th>\n",
       "      <td>0.244147</td>\n",
       "      <td>0.094325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>or</th>\n",
       "      <td>0.242475</td>\n",
       "      <td>0.045608</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>free</th>\n",
       "      <td>0.227425</td>\n",
       "      <td>0.013993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <td>0.219064</td>\n",
       "      <td>0.182172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>txt</th>\n",
       "      <td>0.204013</td>\n",
       "      <td>0.002850</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          spam       ham\n",
       "to    0.625418  0.253693\n",
       "call  0.431438  0.047162\n",
       "your  0.322742  0.073594\n",
       "you   0.311037  0.276237\n",
       "now   0.252508  0.060897\n",
       "for   0.244147  0.094325\n",
       "or    0.242475  0.045608\n",
       "free  0.227425  0.013993\n",
       "the   0.219064  0.182172\n",
       "txt   0.204013  0.002850"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_prob.sort_values(by='spam', ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Words with the smallest conditional probability for predicting ham.\n",
    "\n",
    "P(w_i | y= \"ham\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>spam</th>\n",
       "      <th>ham</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a21</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>lastest</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>largest</th>\n",
       "      <td>0.006689</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>large</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>landmark</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>landlines</th>\n",
       "      <td>0.003344</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>land</th>\n",
       "      <td>0.016722</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>la32wu</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>la3</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>la1</th>\n",
       "      <td>0.001672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               spam  ham\n",
       "a21        0.001672    0\n",
       "lastest    0.001672    0\n",
       "largest    0.006689    0\n",
       "large      0.001672    0\n",
       "landmark   0.001672    0\n",
       "landlines  0.003344    0\n",
       "land       0.016722    0\n",
       "la32wu     0.001672    0\n",
       "la3        0.001672    0\n",
       "la1        0.001672    0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_prob.sort_values(by='ham', ascending=True).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Key Takeaway**: These models are trained looking only at one class at a time, so the largest conditional probabilities may end up being common stop words. However, this will occur in both classes which ends up \"cancelling out\". The stop words won't predict one way or the other. Instead, looking at the least predictive words of the opposite class - in this case the words least predictive of \"ham\" will show us highly predictive spam words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1673</th>\n",
       "      <td>spam</td>\n",
       "      <td>URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     label  \\\n",
       "1673  spam   \n",
       "\n",
       "                                                                                                                                                            message  \n",
       "1673  URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.message.str.contains(\"a21\", case=False)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4373</th>\n",
       "      <td>spam</td>\n",
       "      <td>Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     label  \\\n",
       "4373  spam   \n",
       "\n",
       "                                                                                                                                   message  \n",
       "4373  Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.message.str.contains(\"landmark\", case=False)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3998</th>\n",
       "      <td>spam</td>\n",
       "      <td>Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4864</th>\n",
       "      <td>spam</td>\n",
       "      <td>Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     label  \\\n",
       "3998  spam   \n",
       "4864  spam   \n",
       "\n",
       "                                                                                              message  \n",
       "3998  Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!  \n",
       "4864  Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.message.str.contains(\"landlines\", case=False)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict Test Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# ...."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }