Skip to content

Instantly share code, notes, and snippets.

@vijayanandrp
Last active December 7, 2017 10:23
Show Gist options
  • Save vijayanandrp/581577444746fad5abbf263f11c71146 to your computer and use it in GitHub Desktop.
Save vijayanandrp/581577444746fad5abbf263f11c71146 to your computer and use it in GitHub Desktop.
Data Wrangling tool with simple example - https://informationcorners.com/ml-002-data-wrangling-2/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data wrangling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data wrangling, sometimes referred to as **data munging**, is the process of transforming and mapping data from one \"raw\" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)\n",
"\n",
"Wrangler is an interactive tool for data cleaning and transformation.\n",
"Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ### Example - 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 0 - Requirement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Analysis"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"#!/usr/bin/env python3.5\n",
"# encoding: utf-8\n",
"\n",
"import random\n",
"import csv\n",
"from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier\n",
"\n",
"gender_file = 'gender.csv'\n",
"training_percent = 0.8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"gender_df = pd.read_csv(gender_file, header=None, usecols=[1,2])\n",
"gender_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(137, 2)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>137</td>\n",
" <td>137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>137</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>???? ??? ???????</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>1</td>\n",
" <td>73</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"count 137 137\n",
"unique 137 3\n",
"top ???? ??? ??????? Other/Prefer Not To Answer\n",
"freq 1 73"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Female</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>F</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Female 1</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Female male</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Female1</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"0 Female Female\n",
"1 F Female\n",
"2 Female 1 Female\n",
"3 Female male Female\n",
"4 Female1 Female"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>132</th>\n",
" <td>Wole nie podawac</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>133</th>\n",
" <td>Επιλέξτε</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>134</th>\n",
" <td>선택</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>135</th>\n",
" <td>选择</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>136</th>\n",
" <td>選擇</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"132 Wole nie podawac Other/Prefer Not To Answer\n",
"133 Επιλέξτε Other/Prefer Not To Answer\n",
"134 선택 Other/Prefer Not To Answer\n",
"135 选择 Other/Prefer Not To Answer\n",
"136 選擇 Other/Prefer Not To Answer"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>Mezczyzna</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Ms</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>125</th>\n",
" <td>test1</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>Karlkyns</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>?????</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>女</td>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>Nespecificat</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>???????</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>113</th>\n",
" <td>Prefer Not To Answer</td>\n",
" <td>Other/Prefer Not To Answer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>Άνδρας</td>\n",
" <td>Male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"53 Mezczyzna Male\n",
"20 Ms Female\n",
"125 test1 Other/Prefer Not To Answer\n",
"94 Karlkyns Other/Prefer Not To Answer\n",
"74 ????? Other/Prefer Not To Answer\n",
"33 女 Female\n",
"99 Nespecificat Other/Prefer Not To Answer\n",
"75 ??????? Other/Prefer Not To Answer\n",
"113 Prefer Not To Answer Other/Prefer Not To Answer\n",
"61 Άνδρας Male"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_df.sample(10)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Female', 'Male', 'Other/Prefer Not To Answer'], dtype=object)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_df['correction'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Solution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Making feature matrix X "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def feature_extraction(_data):\n",
" \"\"\" This function is used to extract features in a given data value\"\"\"\n",
" _data = _data.lower()\n",
" f_1, f_2, f_3, f_4, l_1, l_2, l_3, l_4 = None, None, None, None, None, None, None ,None\n",
" \n",
" # extracting first and last 4 characters\n",
" if len(_data) >= 4:\n",
" f_4 = _data[:4]\n",
" l_4 = _data[-4:]\n",
" # extracting first and last 3 characters\n",
" if len(_data) >= 3:\n",
" f_3 = _data[:3]\n",
" l_3 = _data[-3:]\n",
" # extracting first and last 2 characters\n",
" if len(_data) >= 2:\n",
" f_2 = _data[:2]\n",
" l_2 = _data[-2:]\n",
" # extracting first and last 1 character\n",
" if len(_data) >= 1:\n",
" f_1 = _data[:1]\n",
" l_1 = _data[-1:]\n",
" \n",
" feature = {\n",
" 'f_1': f_1,\n",
" 'f_2': f_2,\n",
" 'l_1': l_1,\n",
" 'l_2': l_2,\n",
" 'f_3': f_3,\n",
" 'f_4': f_4,\n",
" 'l_3': l_3,\n",
" 'l_4': l_4\n",
" }\n",
"\n",
" return feature"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Loading dataset"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"dataset = []\n",
"with open(gender_file, newline='\\n') as fp:\n",
" input_data = csv.reader(fp, delimiter=',')\n",
" for row in input_data:\n",
" dataset.append((row[1:]))\n",
"feature_sets = [(actual, correction) for (actual, correction) in dataset]\n",
"random.shuffle(feature_sets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### creating feature matrix X and response vector y"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Visualizing Feature Matrix X"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"feature_val = [val[0] for val in feature_sets]\n",
"feature_df = pd.DataFrame(feature_val)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(137, 8)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>f_1</th>\n",
" <th>f_2</th>\n",
" <th>f_3</th>\n",
" <th>f_4</th>\n",
" <th>l_1</th>\n",
" <th>l_2</th>\n",
" <th>l_3</th>\n",
" <th>l_4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>122</th>\n",
" <td>v</td>\n",
" <td>ve</td>\n",
" <td>vel</td>\n",
" <td>velg</td>\n",
" <td>g</td>\n",
" <td>lg</td>\n",
" <td>elg</td>\n",
" <td>velg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>m</td>\n",
" <td>ma</td>\n",
" <td>mal</td>\n",
" <td>male</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>e 1</td>\n",
" <td>le 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>b</td>\n",
" <td>be</td>\n",
" <td>bez</td>\n",
" <td>bez</td>\n",
" <td>a</td>\n",
" <td>ra</td>\n",
" <td>ora</td>\n",
" <td>vora</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>e</td>\n",
" <td>er</td>\n",
" <td>err</td>\n",
" <td>erre</td>\n",
" <td>k</td>\n",
" <td>ék</td>\n",
" <td>nék</td>\n",
" <td>lnék</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>w</td>\n",
" <td>wa</td>\n",
" <td>wan</td>\n",
" <td>wani</td>\n",
" <td>a</td>\n",
" <td>ta</td>\n",
" <td>ita</td>\n",
" <td>nita</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124</th>\n",
" <td>ž</td>\n",
" <td>že</td>\n",
" <td>žen</td>\n",
" <td>žens</td>\n",
" <td>ý</td>\n",
" <td>ký</td>\n",
" <td>ský</td>\n",
" <td>nský</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>f</td>\n",
" <td>fe</td>\n",
" <td>fem</td>\n",
" <td>femm</td>\n",
" <td>e</td>\n",
" <td>me</td>\n",
" <td>mme</td>\n",
" <td>emme</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>?</td>\n",
" <td>??</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>?</td>\n",
" <td>??</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>?</td>\n",
" <td>??</td>\n",
" <td>??</td>\n",
" <td>?? ?</td>\n",
" <td>?</td>\n",
" <td>??</td>\n",
" <td>???</td>\n",
" <td>????</td>\n",
" </tr>\n",
" <tr>\n",
" <th>119</th>\n",
" <td>m</td>\n",
" <td>mr</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>r</td>\n",
" <td>mr</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" f_1 f_2 f_3 f_4 l_1 l_2 l_3 l_4\n",
"122 v ve vel velg g lg elg velg\n",
"46 m ma mal male 1 1 e 1 le 1\n",
"49 b be bez bez a ra ora vora\n",
"100 e er err erre k ék nék lnék\n",
"59 w wa wan wani a ta ita nita\n",
"124 ž že žen žens ý ký ský nský\n",
"91 f fe fem femm e me mme emme\n",
"42 ? ?? None None ? ?? None None\n",
"34 ? ?? ?? ?? ? ? ?? ??? ????\n",
"119 m mr None None r mr None None"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_df.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Train Test Split "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"cut_point = int(len(feature_sets) * training_percent)\n",
"train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### NaiveBayes Classifier"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"nb_classifier = NaiveBayesClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy of NaiveBayesClassifier: 0.75 \n"
]
}
],
"source": [
"print(\"Accuracy of NaiveBayesClassifier: {} \".format(classify.accuracy(nb_classifier, test_set)))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Most Informative Features\n",
" f_1 = 'm' Male : Other/ = 20.9 : 1.0\n",
" f_1 = 'f' Female : Male = 6.5 : 1.0\n",
" l_1 = 'a' Female : Other/ = 6.3 : 1.0\n",
" f_1 = 'k' Female : Other/ = 3.9 : 1.0\n",
" f_1 = 'p' Other/ : Male = 3.5 : 1.0\n",
" f_2 = 'kv' Female : Other/ = 3.3 : 1.0\n",
" l_1 = 'k' Male : Other/ = 2.8 : 1.0\n",
" l_1 = '1' Male : Other/ = 2.8 : 1.0\n",
" l_1 = 'i' Male : Other/ = 2.8 : 1.0\n",
" f_1 = 'n' Other/ : Female = 2.8 : 1.0\n",
"None\n"
]
}
],
"source": [
"print(nb_classifier.show_most_informative_features(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Maxent Classifier"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ==> Training (100 iterations)\n",
"\n",
" Iteration Log Likelihood Accuracy\n",
" ---------------------------------------\n",
" 1 -1.09861 0.495\n",
" 2 -0.55492 0.954\n",
" 3 -0.39256 0.991\n",
" 4 -0.30684 1.000\n",
" 5 -0.25292 1.000\n",
" 6 -0.21564 1.000\n",
" 7 -0.18823 1.000\n",
" 8 -0.16719 1.000\n",
" 9 -0.15051 1.000\n",
" 10 -0.13694 1.000\n",
" 11 -0.12568 1.000\n",
" 92 -0.01723 1.000\n",
" 93 -0.01705 1.000\n",
" 94 -0.01688 1.000\n",
" 95 -0.01671 1.000\n",
" 96 -0.01654 1.000\n",
" 97 -0.01637 1.000\n",
" 98 -0.01621 1.000\n",
" 99 -0.01605 1.000\n",
" Final -0.01590 1.000\n"
]
}
],
"source": [
"max_classifier = MaxentClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy of MaxentClassifier: 0.75 \n"
]
}
],
"source": [
"print(\"Accuracy of MaxentClassifier: {} \".format(classify.accuracy(max_classifier, test_set)))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" -4.701 f_1=='m' and label is 'Other/Prefer Not To Answer'\n",
" 3.138 l_1=='f' and label is 'Female'\n",
" 3.132 l_1=='男' and label is 'Male'\n",
" 3.132 f_1=='男' and label is 'Male'\n",
" -2.761 f_1=='m' and label is 'Female'\n",
" 2.704 l_1=='女' and label is 'Female'\n",
" 2.704 f_1=='女' and label is 'Female'\n",
" 2.640 l_2=='nő' and label is 'Female'\n",
" 2.640 l_1=='ő' and label is 'Female'\n",
" 2.640 f_2=='nő' and label is 'Female'\n",
"None\n"
]
}
],
"source": [
"print(max_classifier.show_most_informative_features(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Decision Tree Classifier"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"decision_classifier = DecisionTreeClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy of DecisionTreeClassifier: 0.6428571428571429 \n"
]
}
],
"source": [
"print(\"Accuracy of DecisionTreeClassifier: {} \".format(classify.accuracy(decision_classifier, test_set)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4. Evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Enter q (or) quit to end this test module\n",
"\n",
"Enter data for testing: M\n",
"{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'm', 'f_2': None, 'f_1': 'm', 'l_3': None}\n",
"NaiveBayes Classifier : Male\n",
"Maxent Classifier : Male\n",
"Decision Tree Classifier : Male\n",
"---------------------------------------------------------------------------\n",
"(Best of 3) = Male\n",
"\n",
"Enter data for testing: F\n",
"{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'f', 'f_2': None, 'f_1': 'f', 'l_3': None}\n",
"NaiveBayes Classifier : Female\n",
"Maxent Classifier : Female\n",
"Decision Tree Classifier : Female\n",
"---------------------------------------------------------------------------\n",
"(Best of 3) = Female\n",
"\n",
"Enter data for testing: female\n",
"{'f_3': 'fem', 'l_4': 'male', 'f_4': 'fema', 'l_2': 'le', 'l_1': 'e', 'f_2': 'fe', 'f_1': 'f', 'l_3': 'ale'}\n",
"NaiveBayes Classifier : Female\n",
"Maxent Classifier : Female\n",
"Decision Tree Classifier : Female\n",
"---------------------------------------------------------------------------\n",
"(Best of 3) = Female\n",
"\n",
"Enter data for testing: MMMMM\n",
"{'f_3': 'mmm', 'l_4': 'mmmm', 'f_4': 'mmmm', 'l_2': 'mm', 'l_1': 'm', 'f_2': 'mm', 'f_1': 'm', 'l_3': 'mmm'}\n",
"NaiveBayes Classifier : Male\n",
"Maxent Classifier : Male\n",
"Decision Tree Classifier : Female\n",
"---------------------------------------------------------------------------\n",
"(Best of 3) = Male\n",
"\n",
"Enter data for testing: FFFFF\n",
"{'f_3': 'fff', 'l_4': 'ffff', 'f_4': 'ffff', 'l_2': 'ff', 'l_1': 'f', 'f_2': 'ff', 'f_1': 'f', 'l_3': 'fff'}\n",
"NaiveBayes Classifier : Female\n",
"Maxent Classifier : Female\n",
"Decision Tree Classifier : Female\n",
"---------------------------------------------------------------------------\n",
"(Best of 3) = Female\n"
]
}
],
"source": [
"print('Enter q (or) quit to end this test module')\n",
"while 1:\n",
" data = input('\\nEnter data for testing: ')\n",
" if data.lower() == 'q' or data.lower() == 'quit':\n",
" print('End')\n",
" break\n",
"\n",
" if not len(data):\n",
" continue\n",
"\n",
" features = feature_extraction(data)\n",
" print(features)\n",
" prediction = [nb_classifier.classify(features),\n",
" max_classifier.classify(features),\n",
" decision_classifier.classify(features)]\n",
"\n",
" print('NaiveBayes Classifier : ', prediction[0])\n",
" print('Maxent Classifier : ', prediction[1])\n",
" print('Decision Tree Classifier : ', prediction[2])\n",
" print('-'*75)\n",
" print('(Best of 3) = ', max(set(prediction), key=prediction.count))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment