Last active
December 7, 2017 10:23
-
-
Save vijayanandrp/581577444746fad5abbf263f11c71146 to your computer and use it in GitHub Desktop.
Data Wrangling tool with simple example - https://informationcorners.com/ml-002-data-wrangling-2/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Data wrangling" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Data wrangling, sometimes referred to as **data munging**, is the process of transforming and mapping data from one \"raw\" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)\n", | |
"\n", | |
"Wrangler is an interactive tool for data cleaning and transformation.\n", | |
"Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" ### Example - 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 0 - Requirement" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 1. Analysis" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"#!/usr/bin/env python3.5\n", | |
"# encoding: utf-8\n", | |
"\n", | |
"import random\n", | |
"import csv\n", | |
"from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier\n", | |
"\n", | |
"gender_file = 'gender.csv'\n", | |
"training_percent = 0.8" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"gender_df = pd.read_csv(gender_file, header=None, usecols=[1,2])\n", | |
"gender_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(137, 2)" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"gender_df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>actual</th>\n", | |
" <th>correction</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>137</td>\n", | |
" <td>137</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>unique</th>\n", | |
" <td>137</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>top</th>\n", | |
" <td>???? ??? ???????</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>freq</th>\n", | |
" <td>1</td>\n", | |
" <td>73</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" actual correction\n", | |
"count 137 137\n", | |
"unique 137 3\n", | |
"top ???? ??? ??????? Other/Prefer Not To Answer\n", | |
"freq 1 73" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"gender_df.describe()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>actual</th>\n", | |
" <th>correction</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Female</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>F</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Female 1</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Female male</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Female1</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" actual correction\n", | |
"0 Female Female\n", | |
"1 F Female\n", | |
"2 Female 1 Female\n", | |
"3 Female male Female\n", | |
"4 Female1 Female" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"gender_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>actual</th>\n", | |
" <th>correction</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>132</th>\n", | |
" <td>Wole nie podawac</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>133</th>\n", | |
" <td>Επιλέξτε</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>134</th>\n", | |
" <td>선택</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>135</th>\n", | |
" <td>选择</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>136</th>\n", | |
" <td>選擇</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" actual correction\n", | |
"132 Wole nie podawac Other/Prefer Not To Answer\n", | |
"133 Επιλέξτε Other/Prefer Not To Answer\n", | |
"134 선택 Other/Prefer Not To Answer\n", | |
"135 选择 Other/Prefer Not To Answer\n", | |
"136 選擇 Other/Prefer Not To Answer" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"gender_df.tail()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>actual</th>\n", | |
" <th>correction</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>53</th>\n", | |
" <td>Mezczyzna</td>\n", | |
" <td>Male</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>20</th>\n", | |
" <td>Ms</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>125</th>\n", | |
" <td>test1</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>94</th>\n", | |
" <td>Karlkyns</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>74</th>\n", | |
" <td>?????</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>33</th>\n", | |
" <td>女</td>\n", | |
" <td>Female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99</th>\n", | |
" <td>Nespecificat</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75</th>\n", | |
" <td>???????</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>113</th>\n", | |
" <td>Prefer Not To Answer</td>\n", | |
" <td>Other/Prefer Not To Answer</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>61</th>\n", | |
" <td>Άνδρας</td>\n", | |
" <td>Male</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" actual correction\n", | |
"53 Mezczyzna Male\n", | |
"20 Ms Female\n", | |
"125 test1 Other/Prefer Not To Answer\n", | |
"94 Karlkyns Other/Prefer Not To Answer\n", | |
"74 ????? Other/Prefer Not To Answer\n", | |
"33 女 Female\n", | |
"99 Nespecificat Other/Prefer Not To Answer\n", | |
"75 ??????? Other/Prefer Not To Answer\n", | |
"113 Prefer Not To Answer Other/Prefer Not To Answer\n", | |
"61 Άνδρας Male" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"gender_df.sample(10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(['Female', 'Male', 'Other/Prefer Not To Answer'], dtype=object)" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"gender_df['correction'].unique()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 2. Solution" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Making feature matrix X " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def feature_extraction(_data):\n", | |
" \"\"\" This function is used to extract features in a given data value\"\"\"\n", | |
" _data = _data.lower()\n", | |
" f_1, f_2, f_3, f_4, l_1, l_2, l_3, l_4 = None, None, None, None, None, None, None ,None\n", | |
" \n", | |
" # extracting first and last 4 characters\n", | |
" if len(_data) >= 4:\n", | |
" f_4 = _data[:4]\n", | |
" l_4 = _data[-4:]\n", | |
" # extracting first and last 3 characters\n", | |
" if len(_data) >= 3:\n", | |
" f_3 = _data[:3]\n", | |
" l_3 = _data[-3:]\n", | |
" # extracting first and last 2 characters\n", | |
" if len(_data) >= 2:\n", | |
" f_2 = _data[:2]\n", | |
" l_2 = _data[-2:]\n", | |
" # extracting first and last 1 character\n", | |
" if len(_data) >= 1:\n", | |
" f_1 = _data[:1]\n", | |
" l_1 = _data[-1:]\n", | |
" \n", | |
" feature = {\n", | |
" 'f_1': f_1,\n", | |
" 'f_2': f_2,\n", | |
" 'l_1': l_1,\n", | |
" 'l_2': l_2,\n", | |
" 'f_3': f_3,\n", | |
" 'f_4': f_4,\n", | |
" 'l_3': l_3,\n", | |
" 'l_4': l_4\n", | |
" }\n", | |
"\n", | |
" return feature" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Loading dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"dataset = []\n", | |
"with open(gender_file, newline='\\n') as fp:\n", | |
" input_data = csv.reader(fp, delimiter=',')\n", | |
" for row in input_data:\n", | |
" dataset.append((row[1:]))\n", | |
"feature_sets = [(actual, correction) for (actual, correction) in dataset]\n", | |
"random.shuffle(feature_sets)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### creating feature matrix X and response vector y" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Visualizing Feature Matrix X" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"feature_val = [val[0] for val in feature_sets]\n", | |
"feature_df = pd.DataFrame(feature_val)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(137, 8)" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"feature_df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>f_1</th>\n", | |
" <th>f_2</th>\n", | |
" <th>f_3</th>\n", | |
" <th>f_4</th>\n", | |
" <th>l_1</th>\n", | |
" <th>l_2</th>\n", | |
" <th>l_3</th>\n", | |
" <th>l_4</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>122</th>\n", | |
" <td>v</td>\n", | |
" <td>ve</td>\n", | |
" <td>vel</td>\n", | |
" <td>velg</td>\n", | |
" <td>g</td>\n", | |
" <td>lg</td>\n", | |
" <td>elg</td>\n", | |
" <td>velg</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>46</th>\n", | |
" <td>m</td>\n", | |
" <td>ma</td>\n", | |
" <td>mal</td>\n", | |
" <td>male</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>e 1</td>\n", | |
" <td>le 1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>49</th>\n", | |
" <td>b</td>\n", | |
" <td>be</td>\n", | |
" <td>bez</td>\n", | |
" <td>bez</td>\n", | |
" <td>a</td>\n", | |
" <td>ra</td>\n", | |
" <td>ora</td>\n", | |
" <td>vora</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>100</th>\n", | |
" <td>e</td>\n", | |
" <td>er</td>\n", | |
" <td>err</td>\n", | |
" <td>erre</td>\n", | |
" <td>k</td>\n", | |
" <td>ék</td>\n", | |
" <td>nék</td>\n", | |
" <td>lnék</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>59</th>\n", | |
" <td>w</td>\n", | |
" <td>wa</td>\n", | |
" <td>wan</td>\n", | |
" <td>wani</td>\n", | |
" <td>a</td>\n", | |
" <td>ta</td>\n", | |
" <td>ita</td>\n", | |
" <td>nita</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>124</th>\n", | |
" <td>ž</td>\n", | |
" <td>že</td>\n", | |
" <td>žen</td>\n", | |
" <td>žens</td>\n", | |
" <td>ý</td>\n", | |
" <td>ký</td>\n", | |
" <td>ský</td>\n", | |
" <td>nský</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>91</th>\n", | |
" <td>f</td>\n", | |
" <td>fe</td>\n", | |
" <td>fem</td>\n", | |
" <td>femm</td>\n", | |
" <td>e</td>\n", | |
" <td>me</td>\n", | |
" <td>mme</td>\n", | |
" <td>emme</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>42</th>\n", | |
" <td>?</td>\n", | |
" <td>??</td>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" <td>?</td>\n", | |
" <td>??</td>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>34</th>\n", | |
" <td>?</td>\n", | |
" <td>??</td>\n", | |
" <td>??</td>\n", | |
" <td>?? ?</td>\n", | |
" <td>?</td>\n", | |
" <td>??</td>\n", | |
" <td>???</td>\n", | |
" <td>????</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>119</th>\n", | |
" <td>m</td>\n", | |
" <td>mr</td>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" <td>r</td>\n", | |
" <td>mr</td>\n", | |
" <td>None</td>\n", | |
" <td>None</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" f_1 f_2 f_3 f_4 l_1 l_2 l_3 l_4\n", | |
"122 v ve vel velg g lg elg velg\n", | |
"46 m ma mal male 1 1 e 1 le 1\n", | |
"49 b be bez bez a ra ora vora\n", | |
"100 e er err erre k ék nék lnék\n", | |
"59 w wa wan wani a ta ita nita\n", | |
"124 ž že žen žens ý ký ský nský\n", | |
"91 f fe fem femm e me mme emme\n", | |
"42 ? ?? None None ? ?? None None\n", | |
"34 ? ?? ?? ?? ? ? ?? ??? ????\n", | |
"119 m mr None None r mr None None" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"feature_df.sample(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Train Test Split " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"cut_point = int(len(feature_sets) * training_percent)\n", | |
"train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### NaiveBayes Classifier" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"nb_classifier = NaiveBayesClassifier.train(train_set)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Accuracy of NaiveBayesClassifier: 0.75 \n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Accuracy of NaiveBayesClassifier: {} \".format(classify.accuracy(nb_classifier, test_set)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Most Informative Features\n", | |
" f_1 = 'm' Male : Other/ = 20.9 : 1.0\n", | |
" f_1 = 'f' Female : Male = 6.5 : 1.0\n", | |
" l_1 = 'a' Female : Other/ = 6.3 : 1.0\n", | |
" f_1 = 'k' Female : Other/ = 3.9 : 1.0\n", | |
" f_1 = 'p' Other/ : Male = 3.5 : 1.0\n", | |
" f_2 = 'kv' Female : Other/ = 3.3 : 1.0\n", | |
" l_1 = 'k' Male : Other/ = 2.8 : 1.0\n", | |
" l_1 = '1' Male : Other/ = 2.8 : 1.0\n", | |
" l_1 = 'i' Male : Other/ = 2.8 : 1.0\n", | |
" f_1 = 'n' Other/ : Female = 2.8 : 1.0\n", | |
"None\n" | |
] | |
} | |
], | |
"source": [ | |
"print(nb_classifier.show_most_informative_features(10))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Maxent Classifier" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" ==> Training (100 iterations)\n", | |
"\n", | |
" Iteration Log Likelihood Accuracy\n", | |
" ---------------------------------------\n", | |
" 1 -1.09861 0.495\n", | |
" 2 -0.55492 0.954\n", | |
" 3 -0.39256 0.991\n", | |
" 4 -0.30684 1.000\n", | |
" 5 -0.25292 1.000\n", | |
" 6 -0.21564 1.000\n", | |
" 7 -0.18823 1.000\n", | |
" 8 -0.16719 1.000\n", | |
" 9 -0.15051 1.000\n", | |
" 10 -0.13694 1.000\n", | |
" 11 -0.12568 1.000\n", | |
" 92 -0.01723 1.000\n", | |
" 93 -0.01705 1.000\n", | |
" 94 -0.01688 1.000\n", | |
" 95 -0.01671 1.000\n", | |
" 96 -0.01654 1.000\n", | |
" 97 -0.01637 1.000\n", | |
" 98 -0.01621 1.000\n", | |
" 99 -0.01605 1.000\n", | |
" Final -0.01590 1.000\n" | |
] | |
} | |
], | |
"source": [ | |
"max_classifier = MaxentClassifier.train(train_set)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Accuracy of MaxentClassifier: 0.75 \n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Accuracy of MaxentClassifier: {} \".format(classify.accuracy(max_classifier, test_set)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" -4.701 f_1=='m' and label is 'Other/Prefer Not To Answer'\n", | |
" 3.138 l_1=='f' and label is 'Female'\n", | |
" 3.132 l_1=='男' and label is 'Male'\n", | |
" 3.132 f_1=='男' and label is 'Male'\n", | |
" -2.761 f_1=='m' and label is 'Female'\n", | |
" 2.704 l_1=='女' and label is 'Female'\n", | |
" 2.704 f_1=='女' and label is 'Female'\n", | |
" 2.640 l_2=='nő' and label is 'Female'\n", | |
" 2.640 l_1=='ő' and label is 'Female'\n", | |
" 2.640 f_2=='nő' and label is 'Female'\n", | |
"None\n" | |
] | |
} | |
], | |
"source": [ | |
"print(max_classifier.show_most_informative_features(10))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##### Decision Tree Classifier" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"decision_classifier = DecisionTreeClassifier.train(train_set)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Accuracy of DecisionTreeClassifier: 0.6428571428571429 \n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Accuracy of DecisionTreeClassifier: {} \".format(classify.accuracy(decision_classifier, test_set)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 4. Evaluation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Enter q (or) quit to end this test module\n", | |
"\n", | |
"Enter data for testing: M\n", | |
"{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'm', 'f_2': None, 'f_1': 'm', 'l_3': None}\n", | |
"NaiveBayes Classifier : Male\n", | |
"Maxent Classifier : Male\n", | |
"Decision Tree Classifier : Male\n", | |
"---------------------------------------------------------------------------\n", | |
"(Best of 3) = Male\n", | |
"\n", | |
"Enter data for testing: F\n", | |
"{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'f', 'f_2': None, 'f_1': 'f', 'l_3': None}\n", | |
"NaiveBayes Classifier : Female\n", | |
"Maxent Classifier : Female\n", | |
"Decision Tree Classifier : Female\n", | |
"---------------------------------------------------------------------------\n", | |
"(Best of 3) = Female\n", | |
"\n", | |
"Enter data for testing: female\n", | |
"{'f_3': 'fem', 'l_4': 'male', 'f_4': 'fema', 'l_2': 'le', 'l_1': 'e', 'f_2': 'fe', 'f_1': 'f', 'l_3': 'ale'}\n", | |
"NaiveBayes Classifier : Female\n", | |
"Maxent Classifier : Female\n", | |
"Decision Tree Classifier : Female\n", | |
"---------------------------------------------------------------------------\n", | |
"(Best of 3) = Female\n", | |
"\n", | |
"Enter data for testing: MMMMM\n", | |
"{'f_3': 'mmm', 'l_4': 'mmmm', 'f_4': 'mmmm', 'l_2': 'mm', 'l_1': 'm', 'f_2': 'mm', 'f_1': 'm', 'l_3': 'mmm'}\n", | |
"NaiveBayes Classifier : Male\n", | |
"Maxent Classifier : Male\n", | |
"Decision Tree Classifier : Female\n", | |
"---------------------------------------------------------------------------\n", | |
"(Best of 3) = Male\n", | |
"\n", | |
"Enter data for testing: FFFFF\n", | |
"{'f_3': 'fff', 'l_4': 'ffff', 'f_4': 'ffff', 'l_2': 'ff', 'l_1': 'f', 'f_2': 'ff', 'f_1': 'f', 'l_3': 'fff'}\n", | |
"NaiveBayes Classifier : Female\n", | |
"Maxent Classifier : Female\n", | |
"Decision Tree Classifier : Female\n", | |
"---------------------------------------------------------------------------\n", | |
"(Best of 3) = Female\n" | |
] | |
} | |
], | |
"source": [ | |
"print('Enter q (or) quit to end this test module')\n", | |
"while 1:\n", | |
" data = input('\\nEnter data for testing: ')\n", | |
" if data.lower() == 'q' or data.lower() == 'quit':\n", | |
" print('End')\n", | |
" break\n", | |
"\n", | |
" if not len(data):\n", | |
" continue\n", | |
"\n", | |
" features = feature_extraction(data)\n", | |
" print(features)\n", | |
" prediction = [nb_classifier.classify(features),\n", | |
" max_classifier.classify(features),\n", | |
" decision_classifier.classify(features)]\n", | |
"\n", | |
" print('NaiveBayes Classifier : ', prediction[0])\n", | |
" print('Maxent Classifier : ', prediction[1])\n", | |
" print('Decision Tree Classifier : ', prediction[2])\n", | |
" print('-'*75)\n", | |
" print('(Best of 3) = ', max(set(prediction), key=prediction.count))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment