vijayanandrp · December 6, 2017 00:33
diff --git a/age_analysis.ipynb b/age_analysis.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Data wrangling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data wrangling, sometimes referred to as **data munging**, is the process of transforming and mapping data from one \"raw\" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)\n",
    "\n",
    "Wrangler is an interactive tool for data cleaning and transformation.\n",
    "Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### Example - 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 0 - Requirement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  1. Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env python3.5\n",
    "# encoding: utf-8\n",
    "\n",
    "import random\n",
    "import csv\n",
    "from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier\n",
    "\n",
    "age_file = 'age.csv'\n",
    "training_percent = 0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "age_df = pd.read_csv(age_file, header=None, usecols=[1,2])\n",
    "age_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(480, 2)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>480</td>\n",
       "      <td>480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>480</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>36years old</td>\n",
       "      <td>30 to 34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>1</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             actual correction\n",
       "count           480        480\n",
       "unique          480         14\n",
       "top     36years old   30 to 34\n",
       "freq              1         51"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18 to 20</td>\n",
       "      <td>18 to 20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>18</td>\n",
       "      <td>18 to 20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18 - 20</td>\n",
       "      <td>18 to 20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>18 - 21</td>\n",
       "      <td>18 to 20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18 - 22</td>\n",
       "      <td>18 to 20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     actual correction\n",
       "0  18 to 20   18 to 20\n",
       "1        18   18 to 20\n",
       "2   18 - 20   18 to 20\n",
       "3   18 - 21   18 to 20\n",
       "4   18 - 22   18 to 20"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>475</th>\n",
       "      <td>?? ??</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>476</th>\n",
       "      <td>?? ??? ????</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>477</th>\n",
       "      <td>???? ??? ????</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>478</th>\n",
       "      <td>SHL Bureau 7</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>479</th>\n",
       "      <td>SMT6</td>\n",
       "      <td>?</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            actual correction\n",
       "475          ?? ??          ?\n",
       "476    ?? ??? ????          ?\n",
       "477  ???? ??? ????          ?\n",
       "478   SHL Bureau 7          ?\n",
       "479           SMT6          ?"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>435</th>\n",
       "      <td>78</td>\n",
       "      <td>65+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>284</th>\n",
       "      <td>47 years</td>\n",
       "      <td>45 to 49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>261</th>\n",
       "      <td>45</td>\n",
       "      <td>45 to 49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>22 years</td>\n",
       "      <td>21 to 24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>31</td>\n",
       "      <td>30 to 34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>109</th>\n",
       "      <td>28years old</td>\n",
       "      <td>25 to 29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>461</th>\n",
       "      <td>Jonger dan 18</td>\n",
       "      <td>Under 18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>129</th>\n",
       "      <td>30 a 34</td>\n",
       "      <td>30 to 34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>187</th>\n",
       "      <td>36 - 40</td>\n",
       "      <td>35 to 39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>108</th>\n",
       "      <td>28Years</td>\n",
       "      <td>25 to 29</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            actual correction\n",
       "435             78        65+\n",
       "284       47 years   45 to 49\n",
       "261             45   45 to 49\n",
       "46        22 years   21 to 24\n",
       "140             31   30 to 34\n",
       "109    28years old   25 to 29\n",
       "461  Jonger dan 18   Under 18\n",
       "129        30 a 34   30 to 34\n",
       "187        36 - 40   35 to 39\n",
       "108        28Years   25 to 29"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_df.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['18 to 20', '21 to 24', '25 to 29', '30 to 34', '35 to 39',\n",
       "       '40 to 44', '45 to 49', '50 to 54', '55 to 59', '60 to 64', '65+',\n",
       "       'Declined to Respond', 'Under 18', '?'], dtype=object)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_df['correction'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Making feature matrix  X "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def feature_extraction(_data):\n",
    "    \"\"\" This function is used to extract features in a given data value\"\"\"\n",
    "    # Find the digits in the given string Example - data='18-20' digits = '1820'\n",
    "    digits = str(''.join(c for c in _data if c.isdigit()))\n",
    "    # calculate the length of the string\n",
    "    len_digits = len(digits)\n",
    "    # splitting digits in to values example - digits = '1820' ages = [18, 20]\n",
    "    ages = [int(digits[i:i + 2]) for i in range(0, len_digits, 2)]\n",
    "    # checking for special character in the given data\n",
    "    special_character = '.+-<>?'\n",
    "    spl_char = ''.join([c for c in list(special_character) if c in _data])\n",
    "    # handling decimal age data\n",
    "    if len_digits == 3:\n",
    "        spl_char = '.'\n",
    "        age = \"\".join([str(ages[0]), '.', str(ages[1])])\n",
    "        # normalizing\n",
    "        age = int(float(age) - 0.5)\n",
    "        ages = [age]\n",
    "    # Finding the maximum, minimum, average age values\n",
    "    max_age = 0\n",
    "    min_age = 0\n",
    "    mean_age = 0\n",
    "    if len(ages):\n",
    "        max_age = max(ages)\n",
    "        min_age = min(ages)\n",
    "    if len(ages) == 2:\n",
    "        mean_age = int((max_age + min_age) / 2)\n",
    "    else:\n",
    "        mean_age = max_age\n",
    "    # specially added for 18 years cases\n",
    "    only_18 = 0\n",
    "    is_y = 0\n",
    "    if ages == [18]:\n",
    "        only_18 = 1\n",
    "        if 'y' in _data or 'Y' in _data:\n",
    "            is_y = 1\n",
    "    under_18 = 0\n",
    "    if 1 < max_age < 18:\n",
    "        under_18 = 1\n",
    "    above_65 = 0\n",
    "    if mean_age >= 65:\n",
    "        above_65 = 1\n",
    "    # verifying whether digit is found in the given string or not.\n",
    "    # Example - data='18-20' digits_found=True data='????' digits_found=False\n",
    "    digits_found = 1\n",
    "    if len_digits == 1:\n",
    "        digits_found = 1\n",
    "        max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = 0, 0, 0, 0, 0, 0, 0\n",
    "    elif len_digits == 0:\n",
    "        digits_found, max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = -1, -1, -1, -1, -1, -1, -1, -1\n",
    "     \n",
    "    feature = {\n",
    "        'ages': tuple(ages),\n",
    "        'len(ages)': len(ages),\n",
    "        'spl_chr': spl_char,\n",
    "        'is_digit': digits_found,\n",
    "        'max_age': max_age,\n",
    "        'mean_age': mean_age,\n",
    "        'only_18': only_18,\n",
    "        'is_y': is_y,\n",
    "        'above_65': above_65,\n",
    "        'under_18': under_18\n",
    "    }\n",
    "\n",
    "    return feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Loading dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = []\n",
    "with open(age_file, newline='\\n') as fp:\n",
    "    input_data = csv.reader(fp, delimiter=',')\n",
    "    for row in input_data:\n",
    "        dataset.append((row[1:]))\n",
    "feature_sets = [(actual, correction) for (actual, correction) in dataset]\n",
    "random.shuffle(feature_sets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### creating feature matrix X and response vector y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Visualizing Feature Matrix X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_val = [val[0]  for val in feature_sets]\n",
    "feature_df = pd.DataFrame(feature_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(480, 10)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>above_65</th>\n",
       "      <th>ages</th>\n",
       "      <th>is_digit</th>\n",
       "      <th>is_y</th>\n",
       "      <th>len(ages)</th>\n",
       "      <th>max_age</th>\n",
       "      <th>mean_age</th>\n",
       "      <th>only_18</th>\n",
       "      <th>spl_chr</th>\n",
       "      <th>under_18</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>432</th>\n",
       "      <td>0</td>\n",
       "      <td>(63,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>63</td>\n",
       "      <td>63</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>154</th>\n",
       "      <td>0</td>\n",
       "      <td>(32,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>430</th>\n",
       "      <td>0</td>\n",
       "      <td>(58,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>58</td>\n",
       "      <td>58</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>407</th>\n",
       "      <td>0</td>\n",
       "      <td>(34,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>34</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>65</th>\n",
       "      <td>0</td>\n",
       "      <td>(34,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>34</td>\n",
       "      <td>34</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>462</th>\n",
       "      <td>1</td>\n",
       "      <td>(65,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>65</td>\n",
       "      <td>65</td>\n",
       "      <td>0</td>\n",
       "      <td>&gt;</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>1</td>\n",
       "      <td>(65,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>65</td>\n",
       "      <td>65</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>356</th>\n",
       "      <td>-1</td>\n",
       "      <td>()</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>?</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>0</td>\n",
       "      <td>(20, 24)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>24</td>\n",
       "      <td>22</td>\n",
       "      <td>0</td>\n",
       "      <td>-</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>451</th>\n",
       "      <td>1</td>\n",
       "      <td>(66,)</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>66</td>\n",
       "      <td>66</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     above_65      ages  is_digit  is_y  len(ages)  max_age  mean_age  \\\n",
       "432         0     (63,)         1     0          1       63        63   \n",
       "154         0     (32,)         1     0          1       32        32   \n",
       "430         0     (58,)         1     0          1       58        58   \n",
       "407         0     (34,)         1     0          1       34        34   \n",
       "65          0     (34,)         1     0          1       34        34   \n",
       "462         1     (65,)         1     0          1       65        65   \n",
       "97          1     (65,)         1     0          1       65        65   \n",
       "356        -1        ()        -1    -1          0       -1        -1   \n",
       "180         0  (20, 24)         1     0          2       24        22   \n",
       "451         1     (66,)         1     0          1       66        66   \n",
       "\n",
       "     only_18 spl_chr  under_18  \n",
       "432        0                 0  \n",
       "154        0                 0  \n",
       "430        0                 0  \n",
       "407        0                 0  \n",
       "65         0                 0  \n",
       "462        0       >         0  \n",
       "97         0                 0  \n",
       "356       -1       ?        -1  \n",
       "180        0       -         0  \n",
       "451        0                 0  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_df.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Train Test Split "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "cut_point = int(len(feature_sets) * training_percent)\n",
    "train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### NaiveBayes Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_classifier = NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of NaiveBayesClassifier: 0.9583333333333334 \n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy of NaiveBayesClassifier: {} \".format(classify.accuracy(nb_classifier, test_set)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "                 max_age = 65                65+ : 60 to  =     10.4 : 1.0\n",
      "                 max_age = 59             55 to  : 50 to  =      7.2 : 1.0\n",
      "               len(ages) = 2              60 to  : Under  =      6.3 : 1.0\n",
      "                 spl_chr = ''                65+ : ?      =      5.8 : 1.0\n",
      "                 max_age = 39             35 to  : 30 to  =      5.7 : 1.0\n",
      "                 only_18 = 0              30 to  : ?      =      4.9 : 1.0\n",
      "                under_18 = 0              30 to  : ?      =      4.9 : 1.0\n",
      "                    is_y = 0              30 to  : ?      =      4.9 : 1.0\n",
      "                above_65 = 0              30 to  : ?      =      4.9 : 1.0\n",
      "               len(ages) = 1                 65+ : ?      =      4.9 : 1.0\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(nb_classifier.show_most_informative_features(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Maxent Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ==> Training (100 iterations)\n",
      "\n",
      "      Iteration    Log Likelihood    Accuracy\n",
      "      ---------------------------------------\n",
      "             1          -2.63906        0.055\n",
      "             2          -1.71509        0.930\n",
      "             3          -1.30567        0.964\n",
      "             4          -1.03310        0.964\n",
      "             5          -0.84445        0.966\n",
      "             6          -0.70872        0.992\n",
      "             7          -0.60766        0.992\n",
      "             8          -0.53016        0.992\n",
      "             9          -0.46919        0.992\n",
      "            10          -0.42015        0.992\n",
      "            11          -0.37998        0.992\n",
      "            12          -0.34653        0.992\n",
      "            13          -0.31829        0.992\n",
      "            14          -0.29417        1.000\n",
      "            15          -0.27333        1.000\n",
      "            16          -0.25517        1.000\n",
      "            17          -0.23921        1.000\n",
      "            18          -0.22508        1.000\n",
      "            19          -0.21249        1.000\n",
      "            20          -0.20120        1.000\n",
      "            21          -0.19103        1.000\n",
      "            22          -0.18182        1.000\n",
      "            23          -0.17343        1.000\n",
      "            24          -0.16577        1.000\n",
      "            25          -0.15875        1.000\n",
      "            26          -0.15229        1.000\n",
      "            27          -0.14633        1.000\n",
      "            28          -0.14080        1.000\n",
      "            29          -0.13568        1.000\n",
      "            30          -0.13091        1.000\n",
      "            31          -0.12646        1.000\n",
      "            32          -0.12229        1.000\n",
      "            33          -0.11839        1.000\n",
      "            34          -0.11473        1.000\n",
      "            35          -0.11128        1.000\n",
      "            36          -0.10804        1.000\n",
      "            37          -0.10497        1.000\n",
      "            38          -0.10208        1.000\n",
      "            39          -0.09933        1.000\n",
      "            40          -0.09673        1.000\n",
      "            41          -0.09426        1.000\n",
      "            42          -0.09191        1.000\n",
      "            43          -0.08968        1.000\n",
      "            44          -0.08755        1.000\n",
      "            45          -0.08552        1.000\n",
      "            46          -0.08358        1.000\n",
      "            47          -0.08172        1.000\n",
      "            48          -0.07995        1.000\n",
      "            49          -0.07825        1.000\n",
      "            50          -0.07662        1.000\n",
      "            51          -0.07506        1.000\n",
      "            52          -0.07356        1.000\n",
      "            53          -0.07211        1.000\n",
      "            54          -0.07073        1.000\n",
      "            55          -0.06939        1.000\n",
      "            56          -0.06810        1.000\n",
      "            57          -0.06686        1.000\n",
      "            58          -0.06567        1.000\n",
      "            59          -0.06451        1.000\n",
      "            60          -0.06340        1.000\n",
      "            61          -0.06232        1.000\n",
      "            62          -0.06128        1.000\n",
      "            63          -0.06028        1.000\n",
      "            64          -0.05930        1.000\n",
      "            65          -0.05836        1.000\n",
      "            66          -0.05744        1.000\n",
      "            67          -0.05656        1.000\n",
      "            68          -0.05570        1.000\n",
      "            69          -0.05487        1.000\n",
      "            70          -0.05406        1.000\n",
      "            71          -0.05327        1.000\n",
      "            72          -0.05251        1.000\n",
      "            73          -0.05177        1.000\n",
      "            74          -0.05105        1.000\n",
      "            75          -0.05034        1.000\n",
      "            76          -0.04966        1.000\n",
      "            77          -0.04900        1.000\n",
      "            78          -0.04835        1.000\n",
      "            79          -0.04772        1.000\n",
      "            80          -0.04711        1.000\n",
      "            81          -0.04651        1.000\n",
      "            82          -0.04593        1.000\n",
      "            83          -0.04536        1.000\n",
      "            84          -0.04480        1.000\n",
      "            85          -0.04426        1.000\n",
      "            86          -0.04373        1.000\n",
      "            87          -0.04322        1.000\n",
      "            88          -0.04271        1.000\n",
      "            89          -0.04222        1.000\n",
      "            90          -0.04174        1.000\n",
      "            91          -0.04127        1.000\n",
      "            92          -0.04081        1.000\n",
      "            93          -0.04036        1.000\n",
      "            94          -0.03992        1.000\n",
      "            95          -0.03949        1.000\n",
      "            96          -0.03907        1.000\n",
      "            97          -0.03865        1.000\n",
      "            98          -0.03825        1.000\n",
      "            99          -0.03785        1.000\n",
      "         Final          -0.03747        1.000\n"
     ]
    }
   ],
   "source": [
    "max_classifier = MaxentClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of MaxentClassifier: 0.9791666666666666 \n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy of MaxentClassifier: {} \".format(classify.accuracy(max_classifier, test_set)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   6.505 is_y==1 and label is '18 to 20'\n",
      "  -6.145 spl_chr=='' and label is '?'\n",
      "   4.921 ages==(7,) and label is '?'\n",
      "   4.921 mean_age==0 and label is '?'\n",
      "   4.921 max_age==0 and label is '?'\n",
      "   4.000 ages==(61, 65) and label is '60 to 64'\n",
      "   4.000 mean_age==63 and label is '60 to 64'\n",
      "   3.910 ages==(18, 21) and label is '18 to 20'\n",
      "   3.883 ages==(50, 59) and label is '50 to 54'\n",
      "   3.845 ages==(56, 60) and label is '55 to 59'\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(max_classifier.show_most_informative_features(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Decision Tree Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "decision_classifier = DecisionTreeClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of DecisionTreeClassifier: 0.9166666666666666 \n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy of DecisionTreeClassifier: {} \".format(classify.accuracy(decision_classifier, test_set)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter q (or) quit to end this test module\n",
      "\n",
      "Enter data for testing: q\n",
      "End\n"
     ]
    }
   ],
   "source": [
    "print('Enter q (or) quit to end this test module')\n",
    "while 1:\n",
    "    data = input('\\nEnter data for testing: ')\n",
    "    if data.lower() == 'q' or data.lower() == 'quit':\n",
    "        print('End')\n",
    "        break\n",
    "\n",
    "    if not len(data):\n",
    "        continue\n",
    "\n",
    "    features = feature_extraction(data)\n",
    "    print(features)\n",
    "    prediction = [nb_classifier.classify(features),\n",
    "                  max_classifier.classify(features),\n",
    "                  decision_classifier.classify(features)]\n",
    "\n",
    "    print('NaiveBayes Classifier     : ', prediction[0])\n",
    "    print('Maxent Classifier         : ', prediction[1])\n",
    "    print('Decision Tree Classifier  : ', prediction[2])\n",
    "    print('-'*75)\n",
    "    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
No results found