vijayanandrp · December 7, 2017 10:23
diff --git a/gender_analysis.ipynb b/gender_analysis.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Data wrangling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data wrangling, sometimes referred to as **data munging**, is the process of transforming and mapping data from one \"raw\" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)\n",
    "\n",
    "Wrangler is an interactive tool for data cleaning and transformation.\n",
    "Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### Example - 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 0 - Requirement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  1. Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env python3.5\n",
    "# encoding: utf-8\n",
    "\n",
    "import random\n",
    "import csv\n",
    "from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier\n",
    "\n",
    "gender_file = 'gender.csv'\n",
    "training_percent = 0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "gender_df = pd.read_csv(gender_file, header=None, usecols=[1,2])\n",
    "gender_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(137, 2)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gender_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>137</td>\n",
       "      <td>137</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>137</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>???? ??? ???????</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>1</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  actual                  correction\n",
       "count                137                         137\n",
       "unique               137                           3\n",
       "top     ???? ??? ???????  Other/Prefer Not To Answer\n",
       "freq                   1                          73"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gender_df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Female</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>F</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Female 1</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Female male</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Female1</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        actual correction\n",
       "0       Female     Female\n",
       "1            F     Female\n",
       "2     Female 1     Female\n",
       "3  Female male     Female\n",
       "4      Female1     Female"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gender_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>132</th>\n",
       "      <td>Wole nie podawac</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>133</th>\n",
       "      <td>Επιλέξτε</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>134</th>\n",
       "      <td>선택</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>135</th>\n",
       "      <td>选择</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>136</th>\n",
       "      <td>選擇</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               actual                  correction\n",
       "132  Wole nie podawac  Other/Prefer Not To Answer\n",
       "133          Επιλέξτε  Other/Prefer Not To Answer\n",
       "134                선택  Other/Prefer Not To Answer\n",
       "135                选择  Other/Prefer Not To Answer\n",
       "136                選擇  Other/Prefer Not To Answer"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gender_df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>actual</th>\n",
       "      <th>correction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Mezczyzna</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Ms</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125</th>\n",
       "      <td>test1</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94</th>\n",
       "      <td>Karlkyns</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>?????</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>女</td>\n",
       "      <td>Female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>Nespecificat</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>???????</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>Prefer Not To Answer</td>\n",
       "      <td>Other/Prefer Not To Answer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>Άνδρας</td>\n",
       "      <td>Male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   actual                  correction\n",
       "53              Mezczyzna                        Male\n",
       "20                     Ms                      Female\n",
       "125                 test1  Other/Prefer Not To Answer\n",
       "94               Karlkyns  Other/Prefer Not To Answer\n",
       "74                  ?????  Other/Prefer Not To Answer\n",
       "33                      女                      Female\n",
       "99           Nespecificat  Other/Prefer Not To Answer\n",
       "75                ???????  Other/Prefer Not To Answer\n",
       "113  Prefer Not To Answer  Other/Prefer Not To Answer\n",
       "61                 Άνδρας                        Male"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gender_df.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Female', 'Male', 'Other/Prefer Not To Answer'], dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gender_df['correction'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Solution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Making feature matrix  X "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def feature_extraction(_data):\n",
    "    \"\"\" This function is used to extract features in a given data value\"\"\"\n",
    "    _data = _data.lower()\n",
    "    f_1, f_2, f_3, f_4, l_1, l_2, l_3, l_4 = None, None, None, None, None, None, None ,None\n",
    "    \n",
    "    # extracting first and last 4 characters\n",
    "    if len(_data) >= 4:\n",
    "        f_4 = _data[:4]\n",
    "        l_4 = _data[-4:]\n",
    "    # extracting first and last 3 characters\n",
    "    if len(_data) >= 3:\n",
    "        f_3 = _data[:3]\n",
    "        l_3 = _data[-3:]\n",
    "    # extracting first and last 2 characters\n",
    "    if len(_data) >= 2:\n",
    "        f_2 = _data[:2]\n",
    "        l_2 = _data[-2:]\n",
    "    # extracting first and last 1 character\n",
    "    if len(_data) >= 1:\n",
    "        f_1 = _data[:1]\n",
    "        l_1 = _data[-1:]\n",
    "    \n",
    "    feature = {\n",
    "        'f_1': f_1,\n",
    "        'f_2': f_2,\n",
    "        'l_1': l_1,\n",
    "        'l_2': l_2,\n",
    "        'f_3': f_3,\n",
    "        'f_4': f_4,\n",
    "        'l_3': l_3,\n",
    "        'l_4': l_4\n",
    "    }\n",
    "\n",
    "    return feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Loading dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = []\n",
    "with open(gender_file, newline='\\n') as fp:\n",
    "    input_data = csv.reader(fp, delimiter=',')\n",
    "    for row in input_data:\n",
    "        dataset.append((row[1:]))\n",
    "feature_sets = [(actual, correction) for (actual, correction) in dataset]\n",
    "random.shuffle(feature_sets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### creating feature matrix X and response vector y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Visualizing Feature Matrix X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_val = [val[0]  for val in feature_sets]\n",
    "feature_df = pd.DataFrame(feature_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(137, 8)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>f_1</th>\n",
       "      <th>f_2</th>\n",
       "      <th>f_3</th>\n",
       "      <th>f_4</th>\n",
       "      <th>l_1</th>\n",
       "      <th>l_2</th>\n",
       "      <th>l_3</th>\n",
       "      <th>l_4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>v</td>\n",
       "      <td>ve</td>\n",
       "      <td>vel</td>\n",
       "      <td>velg</td>\n",
       "      <td>g</td>\n",
       "      <td>lg</td>\n",
       "      <td>elg</td>\n",
       "      <td>velg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>m</td>\n",
       "      <td>ma</td>\n",
       "      <td>mal</td>\n",
       "      <td>male</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>e 1</td>\n",
       "      <td>le 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>b</td>\n",
       "      <td>be</td>\n",
       "      <td>bez</td>\n",
       "      <td>bez</td>\n",
       "      <td>a</td>\n",
       "      <td>ra</td>\n",
       "      <td>ora</td>\n",
       "      <td>vora</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>e</td>\n",
       "      <td>er</td>\n",
       "      <td>err</td>\n",
       "      <td>erre</td>\n",
       "      <td>k</td>\n",
       "      <td>ék</td>\n",
       "      <td>nék</td>\n",
       "      <td>lnék</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>w</td>\n",
       "      <td>wa</td>\n",
       "      <td>wan</td>\n",
       "      <td>wani</td>\n",
       "      <td>a</td>\n",
       "      <td>ta</td>\n",
       "      <td>ita</td>\n",
       "      <td>nita</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>ž</td>\n",
       "      <td>že</td>\n",
       "      <td>žen</td>\n",
       "      <td>žens</td>\n",
       "      <td>ý</td>\n",
       "      <td>ký</td>\n",
       "      <td>ský</td>\n",
       "      <td>nský</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>f</td>\n",
       "      <td>fe</td>\n",
       "      <td>fem</td>\n",
       "      <td>femm</td>\n",
       "      <td>e</td>\n",
       "      <td>me</td>\n",
       "      <td>mme</td>\n",
       "      <td>emme</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>?</td>\n",
       "      <td>??</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>?</td>\n",
       "      <td>??</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>?</td>\n",
       "      <td>??</td>\n",
       "      <td>??</td>\n",
       "      <td>?? ?</td>\n",
       "      <td>?</td>\n",
       "      <td>??</td>\n",
       "      <td>???</td>\n",
       "      <td>????</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>119</th>\n",
       "      <td>m</td>\n",
       "      <td>mr</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>r</td>\n",
       "      <td>mr</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    f_1 f_2   f_3   f_4 l_1 l_2   l_3   l_4\n",
       "122   v  ve   vel  velg   g  lg   elg  velg\n",
       "46    m  ma   mal  male   1   1   e 1  le 1\n",
       "49    b  be   bez  bez    a  ra   ora  vora\n",
       "100   e  er   err  erre   k  ék   nék  lnék\n",
       "59    w  wa   wan  wani   a  ta   ita  nita\n",
       "124   ž  že   žen  žens   ý  ký   ský  nský\n",
       "91    f  fe   fem  femm   e  me   mme  emme\n",
       "42    ?  ??  None  None   ?  ??  None  None\n",
       "34    ?  ??   ??   ?? ?   ?  ??   ???  ????\n",
       "119   m  mr  None  None   r  mr  None  None"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_df.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Train Test Split "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "cut_point = int(len(feature_sets) * training_percent)\n",
    "train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### NaiveBayes Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_classifier = NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of NaiveBayesClassifier: 0.75 \n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy of NaiveBayesClassifier: {} \".format(classify.accuracy(nb_classifier, test_set)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "                     f_1 = 'm'              Male : Other/ =     20.9 : 1.0\n",
      "                     f_1 = 'f'            Female : Male   =      6.5 : 1.0\n",
      "                     l_1 = 'a'            Female : Other/ =      6.3 : 1.0\n",
      "                     f_1 = 'k'            Female : Other/ =      3.9 : 1.0\n",
      "                     f_1 = 'p'            Other/ : Male   =      3.5 : 1.0\n",
      "                     f_2 = 'kv'           Female : Other/ =      3.3 : 1.0\n",
      "                     l_1 = 'k'              Male : Other/ =      2.8 : 1.0\n",
      "                     l_1 = '1'              Male : Other/ =      2.8 : 1.0\n",
      "                     l_1 = 'i'              Male : Other/ =      2.8 : 1.0\n",
      "                     f_1 = 'n'            Other/ : Female =      2.8 : 1.0\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(nb_classifier.show_most_informative_features(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Maxent Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ==> Training (100 iterations)\n",
      "\n",
      "      Iteration    Log Likelihood    Accuracy\n",
      "      ---------------------------------------\n",
      "             1          -1.09861        0.495\n",
      "             2          -0.55492        0.954\n",
      "             3          -0.39256        0.991\n",
      "             4          -0.30684        1.000\n",
      "             5          -0.25292        1.000\n",
      "             6          -0.21564        1.000\n",
      "             7          -0.18823        1.000\n",
      "             8          -0.16719        1.000\n",
      "             9          -0.15051        1.000\n",
      "            10          -0.13694        1.000\n",
      "            11          -0.12568        1.000\n",
      "            92          -0.01723        1.000\n",
      "            93          -0.01705        1.000\n",
      "            94          -0.01688        1.000\n",
      "            95          -0.01671        1.000\n",
      "            96          -0.01654        1.000\n",
      "            97          -0.01637        1.000\n",
      "            98          -0.01621        1.000\n",
      "            99          -0.01605        1.000\n",
      "         Final          -0.01590        1.000\n"
     ]
    }
   ],
   "source": [
    "max_classifier = MaxentClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of MaxentClassifier: 0.75 \n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy of MaxentClassifier: {} \".format(classify.accuracy(max_classifier, test_set)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  -4.701 f_1=='m' and label is 'Other/Prefer Not To Answer'\n",
      "   3.138 l_1=='f' and label is 'Female'\n",
      "   3.132 l_1=='男' and label is 'Male'\n",
      "   3.132 f_1=='男' and label is 'Male'\n",
      "  -2.761 f_1=='m' and label is 'Female'\n",
      "   2.704 l_1=='女' and label is 'Female'\n",
      "   2.704 f_1=='女' and label is 'Female'\n",
      "   2.640 l_2=='nő' and label is 'Female'\n",
      "   2.640 l_1=='ő' and label is 'Female'\n",
      "   2.640 f_2=='nő' and label is 'Female'\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(max_classifier.show_most_informative_features(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Decision Tree Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "decision_classifier = DecisionTreeClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of DecisionTreeClassifier: 0.6428571428571429 \n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy of DecisionTreeClassifier: {} \".format(classify.accuracy(decision_classifier, test_set)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Enter q (or) quit to end this test module\n",
      "\n",
      "Enter data for testing: M\n",
      "{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'm', 'f_2': None, 'f_1': 'm', 'l_3': None}\n",
      "NaiveBayes Classifier     :  Male\n",
      "Maxent Classifier         :  Male\n",
      "Decision Tree Classifier  :  Male\n",
      "---------------------------------------------------------------------------\n",
      "(Best of 3) =               Male\n",
      "\n",
      "Enter data for testing: F\n",
      "{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'f', 'f_2': None, 'f_1': 'f', 'l_3': None}\n",
      "NaiveBayes Classifier     :  Female\n",
      "Maxent Classifier         :  Female\n",
      "Decision Tree Classifier  :  Female\n",
      "---------------------------------------------------------------------------\n",
      "(Best of 3) =               Female\n",
      "\n",
      "Enter data for testing: female\n",
      "{'f_3': 'fem', 'l_4': 'male', 'f_4': 'fema', 'l_2': 'le', 'l_1': 'e', 'f_2': 'fe', 'f_1': 'f', 'l_3': 'ale'}\n",
      "NaiveBayes Classifier     :  Female\n",
      "Maxent Classifier         :  Female\n",
      "Decision Tree Classifier  :  Female\n",
      "---------------------------------------------------------------------------\n",
      "(Best of 3) =               Female\n",
      "\n",
      "Enter data for testing: MMMMM\n",
      "{'f_3': 'mmm', 'l_4': 'mmmm', 'f_4': 'mmmm', 'l_2': 'mm', 'l_1': 'm', 'f_2': 'mm', 'f_1': 'm', 'l_3': 'mmm'}\n",
      "NaiveBayes Classifier     :  Male\n",
      "Maxent Classifier         :  Male\n",
      "Decision Tree Classifier  :  Female\n",
      "---------------------------------------------------------------------------\n",
      "(Best of 3) =               Male\n",
      "\n",
      "Enter data for testing: FFFFF\n",
      "{'f_3': 'fff', 'l_4': 'ffff', 'f_4': 'ffff', 'l_2': 'ff', 'l_1': 'f', 'f_2': 'ff', 'f_1': 'f', 'l_3': 'fff'}\n",
      "NaiveBayes Classifier     :  Female\n",
      "Maxent Classifier         :  Female\n",
      "Decision Tree Classifier  :  Female\n",
      "---------------------------------------------------------------------------\n",
      "(Best of 3) =               Female\n"
     ]
    }
   ],
   "source": [
    "print('Enter q (or) quit to end this test module')\n",
    "while 1:\n",
    "    data = input('\\nEnter data for testing: ')\n",
    "    if data.lower() == 'q' or data.lower() == 'quit':\n",
    "        print('End')\n",
    "        break\n",
    "\n",
    "    if not len(data):\n",
    "        continue\n",
    "\n",
    "    features = feature_extraction(data)\n",
    "    print(features)\n",
    "    prediction = [nb_classifier.classify(features),\n",
    "                  max_classifier.classify(features),\n",
    "                  decision_classifier.classify(features)]\n",
    "\n",
    "    print('NaiveBayes Classifier     : ', prediction[0])\n",
    "    print('Maxent Classifier         : ', prediction[1])\n",
    "    print('Decision Tree Classifier  : ', prediction[2])\n",
    "    print('-'*75)\n",
    "    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Data wrangling"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one \"raw\" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)\n",
	"\n",
	"Wrangler is an interactive tool for data cleaning and transformation.\n",
	"Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" ### Example - 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 0 - Requirement"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 1. Analysis"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"#!/usr/bin/env python3.5\n",
	"# encoding: utf-8\n",
	"\n",
	"import random\n",
	"import csv\n",
	"from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier\n",
	"\n",
	"gender_file = 'gender.csv'\n",
	"training_percent = 0.8"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"gender_df = pd.read_csv(gender_file, header=None, usecols=[1,2])\n",
	"gender_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(137, 2)"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"gender_df.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>actual</th>\n",
	" <th>correction</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>count</th>\n",
	" <td>137</td>\n",
	" <td>137</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>unique</th>\n",
	" <td>137</td>\n",
	" <td>3</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>top</th>\n",
	" <td>???? ??? ???????</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>freq</th>\n",
	" <td>1</td>\n",
	" <td>73</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" actual correction\n",
	"count 137 137\n",
	"unique 137 3\n",
	"top ???? ??? ??????? Other/Prefer Not To Answer\n",
	"freq 1 73"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"gender_df.describe()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>actual</th>\n",
	" <th>correction</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>Female</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>F</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>Female 1</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>Female male</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>Female1</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" actual correction\n",
	"0 Female Female\n",
	"1 F Female\n",
	"2 Female 1 Female\n",
	"3 Female male Female\n",
	"4 Female1 Female"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"gender_df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>actual</th>\n",
	" <th>correction</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>132</th>\n",
	" <td>Wole nie podawac</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>133</th>\n",
	" <td>Επιλέξτε</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>134</th>\n",
	" <td>선택</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>135</th>\n",
	" <td>选择</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>136</th>\n",
	" <td>選擇</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" actual correction\n",
	"132 Wole nie podawac Other/Prefer Not To Answer\n",
	"133 Επιλέξτε Other/Prefer Not To Answer\n",
	"134 선택 Other/Prefer Not To Answer\n",
	"135 选择 Other/Prefer Not To Answer\n",
	"136 選擇 Other/Prefer Not To Answer"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"gender_df.tail()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>actual</th>\n",
	" <th>correction</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>53</th>\n",
	" <td>Mezczyzna</td>\n",
	" <td>Male</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>20</th>\n",
	" <td>Ms</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>125</th>\n",
	" <td>test1</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>94</th>\n",
	" <td>Karlkyns</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>74</th>\n",
	" <td>?????</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>33</th>\n",
	" <td>女</td>\n",
	" <td>Female</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>99</th>\n",
	" <td>Nespecificat</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>75</th>\n",
	" <td>???????</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>113</th>\n",
	" <td>Prefer Not To Answer</td>\n",
	" <td>Other/Prefer Not To Answer</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>61</th>\n",
	" <td>Άνδρας</td>\n",
	" <td>Male</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" actual correction\n",
	"53 Mezczyzna Male\n",
	"20 Ms Female\n",
	"125 test1 Other/Prefer Not To Answer\n",
	"94 Karlkyns Other/Prefer Not To Answer\n",
	"74 ????? Other/Prefer Not To Answer\n",
	"33 女 Female\n",
	"99 Nespecificat Other/Prefer Not To Answer\n",
	"75 ??????? Other/Prefer Not To Answer\n",
	"113 Prefer Not To Answer Other/Prefer Not To Answer\n",
	"61 Άνδρας Male"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"gender_df.sample(10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array(['Female', 'Male', 'Other/Prefer Not To Answer'], dtype=object)"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"gender_df['correction'].unique()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 2. Solution"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Making feature matrix X "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [],
	"source": [
	"def feature_extraction(_data):\n",
	" \"\"\" This function is used to extract features in a given data value\"\"\"\n",
	" _data = _data.lower()\n",
	" f_1, f_2, f_3, f_4, l_1, l_2, l_3, l_4 = None, None, None, None, None, None, None ,None\n",
	" \n",
	" # extracting first and last 4 characters\n",
	" if len(_data) >= 4:\n",
	" f_4 = _data[:4]\n",
	" l_4 = _data[-4:]\n",
	" # extracting first and last 3 characters\n",
	" if len(_data) >= 3:\n",
	" f_3 = _data[:3]\n",
	" l_3 = _data[-3:]\n",
	" # extracting first and last 2 characters\n",
	" if len(_data) >= 2:\n",
	" f_2 = _data[:2]\n",
	" l_2 = _data[-2:]\n",
	" # extracting first and last 1 character\n",
	" if len(_data) >= 1:\n",
	" f_1 = _data[:1]\n",
	" l_1 = _data[-1:]\n",
	" \n",
	" feature = {\n",
	" 'f_1': f_1,\n",
	" 'f_2': f_2,\n",
	" 'l_1': l_1,\n",
	" 'l_2': l_2,\n",
	" 'f_3': f_3,\n",
	" 'f_4': f_4,\n",
	" 'l_3': l_3,\n",
	" 'l_4': l_4\n",
	" }\n",
	"\n",
	" return feature"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Loading dataset"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [],
	"source": [
	"dataset = []\n",
	"with open(gender_file, newline='\\n') as fp:\n",
	" input_data = csv.reader(fp, delimiter=',')\n",
	" for row in input_data:\n",
	" dataset.append((row[1:]))\n",
	"feature_sets = [(actual, correction) for (actual, correction) in dataset]\n",
	"random.shuffle(feature_sets)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### creating feature matrix X and response vector y"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Visualizing Feature Matrix X"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [],
	"source": [
	"feature_val = [val[0] for val in feature_sets]\n",
	"feature_df = pd.DataFrame(feature_val)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(137, 8)"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"feature_df.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>f_1</th>\n",
	" <th>f_2</th>\n",
	" <th>f_3</th>\n",
	" <th>f_4</th>\n",
	" <th>l_1</th>\n",
	" <th>l_2</th>\n",
	" <th>l_3</th>\n",
	" <th>l_4</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>122</th>\n",
	" <td>v</td>\n",
	" <td>ve</td>\n",
	" <td>vel</td>\n",
	" <td>velg</td>\n",
	" <td>g</td>\n",
	" <td>lg</td>\n",
	" <td>elg</td>\n",
	" <td>velg</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>46</th>\n",
	" <td>m</td>\n",
	" <td>ma</td>\n",
	" <td>mal</td>\n",
	" <td>male</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>e 1</td>\n",
	" <td>le 1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>49</th>\n",
	" <td>b</td>\n",
	" <td>be</td>\n",
	" <td>bez</td>\n",
	" <td>bez</td>\n",
	" <td>a</td>\n",
	" <td>ra</td>\n",
	" <td>ora</td>\n",
	" <td>vora</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>100</th>\n",
	" <td>e</td>\n",
	" <td>er</td>\n",
	" <td>err</td>\n",
	" <td>erre</td>\n",
	" <td>k</td>\n",
	" <td>ék</td>\n",
	" <td>nék</td>\n",
	" <td>lnék</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>59</th>\n",
	" <td>w</td>\n",
	" <td>wa</td>\n",
	" <td>wan</td>\n",
	" <td>wani</td>\n",
	" <td>a</td>\n",
	" <td>ta</td>\n",
	" <td>ita</td>\n",
	" <td>nita</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>124</th>\n",
	" <td>ž</td>\n",
	" <td>že</td>\n",
	" <td>žen</td>\n",
	" <td>žens</td>\n",
	" <td>ý</td>\n",
	" <td>ký</td>\n",
	" <td>ský</td>\n",
	" <td>nský</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>91</th>\n",
	" <td>f</td>\n",
	" <td>fe</td>\n",
	" <td>fem</td>\n",
	" <td>femm</td>\n",
	" <td>e</td>\n",
	" <td>me</td>\n",
	" <td>mme</td>\n",
	" <td>emme</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>42</th>\n",
	" <td>?</td>\n",
	" <td>??</td>\n",
	" <td>None</td>\n",
	" <td>None</td>\n",
	" <td>?</td>\n",
	" <td>??</td>\n",
	" <td>None</td>\n",
	" <td>None</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>34</th>\n",
	" <td>?</td>\n",
	" <td>??</td>\n",
	" <td>??</td>\n",
	" <td>?? ?</td>\n",
	" <td>?</td>\n",
	" <td>??</td>\n",
	" <td>???</td>\n",
	" <td>????</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>119</th>\n",
	" <td>m</td>\n",
	" <td>mr</td>\n",
	" <td>None</td>\n",
	" <td>None</td>\n",
	" <td>r</td>\n",
	" <td>mr</td>\n",
	" <td>None</td>\n",
	" <td>None</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" f_1 f_2 f_3 f_4 l_1 l_2 l_3 l_4\n",
	"122 v ve vel velg g lg elg velg\n",
	"46 m ma mal male 1 1 e 1 le 1\n",
	"49 b be bez bez a ra ora vora\n",
	"100 e er err erre k ék nék lnék\n",
	"59 w wa wan wani a ta ita nita\n",
	"124 ž že žen žens ý ký ský nský\n",
	"91 f fe fem femm e me mme emme\n",
	"42 ? ?? None None ? ?? None None\n",
	"34 ? ?? ?? ?? ? ? ?? ??? ????\n",
	"119 m mr None None r mr None None"
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"feature_df.sample(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Train Test Split "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [],
	"source": [
	"cut_point = int(len(feature_sets) * training_percent)\n",
	"train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### NaiveBayes Classifier"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [],
	"source": [
	"nb_classifier = NaiveBayesClassifier.train(train_set)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Accuracy of NaiveBayesClassifier: 0.75 \n"
	]
	}
	],
	"source": [
	"print(\"Accuracy of NaiveBayesClassifier: {} \".format(classify.accuracy(nb_classifier, test_set)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Most Informative Features\n",
	" f_1 = 'm' Male : Other/ = 20.9 : 1.0\n",
	" f_1 = 'f' Female : Male = 6.5 : 1.0\n",
	" l_1 = 'a' Female : Other/ = 6.3 : 1.0\n",
	" f_1 = 'k' Female : Other/ = 3.9 : 1.0\n",
	" f_1 = 'p' Other/ : Male = 3.5 : 1.0\n",
	" f_2 = 'kv' Female : Other/ = 3.3 : 1.0\n",
	" l_1 = 'k' Male : Other/ = 2.8 : 1.0\n",
	" l_1 = '1' Male : Other/ = 2.8 : 1.0\n",
	" l_1 = 'i' Male : Other/ = 2.8 : 1.0\n",
	" f_1 = 'n' Other/ : Female = 2.8 : 1.0\n",
	"None\n"
	]
	}
	],
	"source": [
	"print(nb_classifier.show_most_informative_features(10))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Maxent Classifier"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" ==> Training (100 iterations)\n",
	"\n",
	" Iteration Log Likelihood Accuracy\n",
	" ---------------------------------------\n",
	" 1 -1.09861 0.495\n",
	" 2 -0.55492 0.954\n",
	" 3 -0.39256 0.991\n",
	" 4 -0.30684 1.000\n",
	" 5 -0.25292 1.000\n",
	" 6 -0.21564 1.000\n",
	" 7 -0.18823 1.000\n",
	" 8 -0.16719 1.000\n",
	" 9 -0.15051 1.000\n",
	" 10 -0.13694 1.000\n",
	" 11 -0.12568 1.000\n",
	" 92 -0.01723 1.000\n",
	" 93 -0.01705 1.000\n",
	" 94 -0.01688 1.000\n",
	" 95 -0.01671 1.000\n",
	" 96 -0.01654 1.000\n",
	" 97 -0.01637 1.000\n",
	" 98 -0.01621 1.000\n",
	" 99 -0.01605 1.000\n",
	" Final -0.01590 1.000\n"
	]
	}
	],
	"source": [
	"max_classifier = MaxentClassifier.train(train_set)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Accuracy of MaxentClassifier: 0.75 \n"
	]
	}
	],
	"source": [
	"print(\"Accuracy of MaxentClassifier: {} \".format(classify.accuracy(max_classifier, test_set)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" -4.701 f_1=='m' and label is 'Other/Prefer Not To Answer'\n",
	" 3.138 l_1=='f' and label is 'Female'\n",
	" 3.132 l_1=='男' and label is 'Male'\n",
	" 3.132 f_1=='男' and label is 'Male'\n",
	" -2.761 f_1=='m' and label is 'Female'\n",
	" 2.704 l_1=='女' and label is 'Female'\n",
	" 2.704 f_1=='女' and label is 'Female'\n",
	" 2.640 l_2=='nő' and label is 'Female'\n",
	" 2.640 l_1=='ő' and label is 'Female'\n",
	" 2.640 f_2=='nő' and label is 'Female'\n",
	"None\n"
	]
	}
	],
	"source": [
	"print(max_classifier.show_most_informative_features(10))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##### Decision Tree Classifier"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [],
	"source": [
	"decision_classifier = DecisionTreeClassifier.train(train_set)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Accuracy of DecisionTreeClassifier: 0.6428571428571429 \n"
	]
	}
	],
	"source": [
	"print(\"Accuracy of DecisionTreeClassifier: {} \".format(classify.accuracy(decision_classifier, test_set)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 4. Evaluation"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Enter q (or) quit to end this test module\n",
	"\n",
	"Enter data for testing: M\n",
	"{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'm', 'f_2': None, 'f_1': 'm', 'l_3': None}\n",
	"NaiveBayes Classifier : Male\n",
	"Maxent Classifier : Male\n",
	"Decision Tree Classifier : Male\n",
	"---------------------------------------------------------------------------\n",
	"(Best of 3) = Male\n",
	"\n",
	"Enter data for testing: F\n",
	"{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'f', 'f_2': None, 'f_1': 'f', 'l_3': None}\n",
	"NaiveBayes Classifier : Female\n",
	"Maxent Classifier : Female\n",
	"Decision Tree Classifier : Female\n",
	"---------------------------------------------------------------------------\n",
	"(Best of 3) = Female\n",
	"\n",
	"Enter data for testing: female\n",
	"{'f_3': 'fem', 'l_4': 'male', 'f_4': 'fema', 'l_2': 'le', 'l_1': 'e', 'f_2': 'fe', 'f_1': 'f', 'l_3': 'ale'}\n",
	"NaiveBayes Classifier : Female\n",
	"Maxent Classifier : Female\n",
	"Decision Tree Classifier : Female\n",
	"---------------------------------------------------------------------------\n",
	"(Best of 3) = Female\n",
	"\n",
	"Enter data for testing: MMMMM\n",
	"{'f_3': 'mmm', 'l_4': 'mmmm', 'f_4': 'mmmm', 'l_2': 'mm', 'l_1': 'm', 'f_2': 'mm', 'f_1': 'm', 'l_3': 'mmm'}\n",
	"NaiveBayes Classifier : Male\n",
	"Maxent Classifier : Male\n",
	"Decision Tree Classifier : Female\n",
	"---------------------------------------------------------------------------\n",
	"(Best of 3) = Male\n",
	"\n",
	"Enter data for testing: FFFFF\n",
	"{'f_3': 'fff', 'l_4': 'ffff', 'f_4': 'ffff', 'l_2': 'ff', 'l_1': 'f', 'f_2': 'ff', 'f_1': 'f', 'l_3': 'fff'}\n",
	"NaiveBayes Classifier : Female\n",
	"Maxent Classifier : Female\n",
	"Decision Tree Classifier : Female\n",
	"---------------------------------------------------------------------------\n",
	"(Best of 3) = Female\n"
	]
	}
	],
	"source": [
	"print('Enter q (or) quit to end this test module')\n",
	"while 1:\n",
	" data = input('\\nEnter data for testing: ')\n",
	" if data.lower() == 'q' or data.lower() == 'quit':\n",
	" print('End')\n",
	" break\n",
	"\n",
	" if not len(data):\n",
	" continue\n",
	"\n",
	" features = feature_extraction(data)\n",
	" print(features)\n",
	" prediction = [nb_classifier.classify(features),\n",
	" max_classifier.classify(features),\n",
	" decision_classifier.classify(features)]\n",
	"\n",
	" print('NaiveBayes Classifier : ', prediction[0])\n",
	" print('Maxent Classifier : ', prediction[1])\n",
	" print('Decision Tree Classifier : ', prediction[2])\n",
	" print('-'*75)\n",
	" print('(Best of 3) = ', max(set(prediction), key=prediction.count))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}