Skip to content

Instantly share code, notes, and snippets.

@vijayanandrp
Created December 6, 2017 00:33
Show Gist options
  • Save vijayanandrp/e31bebdc26c66b905682c5b2cc38ada0 to your computer and use it in GitHub Desktop.
Save vijayanandrp/e31bebdc26c66b905682c5b2cc38ada0 to your computer and use it in GitHub Desktop.
Data Wrangling tool with simple example - https://informationcorners.com/ml-002-data-wrangling-1/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data wrangling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data wrangling, sometimes referred to as **data munging**, is the process of transforming and mapping data from one \"raw\" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)\n",
"\n",
"Wrangler is an interactive tool for data cleaning and transformation.\n",
"Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ### Example - 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 0 - Requirement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Analysis"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"#!/usr/bin/env python3.5\n",
"# encoding: utf-8\n",
"\n",
"import random\n",
"import csv\n",
"from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier\n",
"\n",
"age_file = 'age.csv'\n",
"training_percent = 0.8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"age_df = pd.read_csv(age_file, header=None, usecols=[1,2])\n",
"age_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(480, 2)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>480</td>\n",
" <td>480</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>480</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>36years old</td>\n",
" <td>30 to 34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>1</td>\n",
" <td>51</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"count 480 480\n",
"unique 480 14\n",
"top 36years old 30 to 34\n",
"freq 1 51"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>18 to 20</td>\n",
" <td>18 to 20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>18</td>\n",
" <td>18 to 20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>18 - 20</td>\n",
" <td>18 to 20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>18 - 21</td>\n",
" <td>18 to 20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>18 - 22</td>\n",
" <td>18 to 20</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"0 18 to 20 18 to 20\n",
"1 18 18 to 20\n",
"2 18 - 20 18 to 20\n",
"3 18 - 21 18 to 20\n",
"4 18 - 22 18 to 20"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>475</th>\n",
" <td>?? ??</td>\n",
" <td>?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>476</th>\n",
" <td>?? ??? ????</td>\n",
" <td>?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>477</th>\n",
" <td>???? ??? ????</td>\n",
" <td>?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>478</th>\n",
" <td>SHL Bureau 7</td>\n",
" <td>?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>479</th>\n",
" <td>SMT6</td>\n",
" <td>?</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"475 ?? ?? ?\n",
"476 ?? ??? ???? ?\n",
"477 ???? ??? ???? ?\n",
"478 SHL Bureau 7 ?\n",
"479 SMT6 ?"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>actual</th>\n",
" <th>correction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>435</th>\n",
" <td>78</td>\n",
" <td>65+</td>\n",
" </tr>\n",
" <tr>\n",
" <th>284</th>\n",
" <td>47 years</td>\n",
" <td>45 to 49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>261</th>\n",
" <td>45</td>\n",
" <td>45 to 49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>22 years</td>\n",
" <td>21 to 24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140</th>\n",
" <td>31</td>\n",
" <td>30 to 34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>109</th>\n",
" <td>28years old</td>\n",
" <td>25 to 29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>461</th>\n",
" <td>Jonger dan 18</td>\n",
" <td>Under 18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>129</th>\n",
" <td>30 a 34</td>\n",
" <td>30 to 34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>187</th>\n",
" <td>36 - 40</td>\n",
" <td>35 to 39</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108</th>\n",
" <td>28Years</td>\n",
" <td>25 to 29</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" actual correction\n",
"435 78 65+\n",
"284 47 years 45 to 49\n",
"261 45 45 to 49\n",
"46 22 years 21 to 24\n",
"140 31 30 to 34\n",
"109 28years old 25 to 29\n",
"461 Jonger dan 18 Under 18\n",
"129 30 a 34 30 to 34\n",
"187 36 - 40 35 to 39\n",
"108 28Years 25 to 29"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df.sample(10)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['18 to 20', '21 to 24', '25 to 29', '30 to 34', '35 to 39',\n",
" '40 to 44', '45 to 49', '50 to 54', '55 to 59', '60 to 64', '65+',\n",
" 'Declined to Respond', 'Under 18', '?'], dtype=object)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age_df['correction'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Solution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Making feature matrix X "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def feature_extraction(_data):\n",
" \"\"\" This function is used to extract features in a given data value\"\"\"\n",
" # Find the digits in the given string Example - data='18-20' digits = '1820'\n",
" digits = str(''.join(c for c in _data if c.isdigit()))\n",
" # calculate the length of the string\n",
" len_digits = len(digits)\n",
" # splitting digits in to values example - digits = '1820' ages = [18, 20]\n",
" ages = [int(digits[i:i + 2]) for i in range(0, len_digits, 2)]\n",
" # checking for special character in the given data\n",
" special_character = '.+-<>?'\n",
" spl_char = ''.join([c for c in list(special_character) if c in _data])\n",
" # handling decimal age data\n",
" if len_digits == 3:\n",
" spl_char = '.'\n",
" age = \"\".join([str(ages[0]), '.', str(ages[1])])\n",
" # normalizing\n",
" age = int(float(age) - 0.5)\n",
" ages = [age]\n",
" # Finding the maximum, minimum, average age values\n",
" max_age = 0\n",
" min_age = 0\n",
" mean_age = 0\n",
" if len(ages):\n",
" max_age = max(ages)\n",
" min_age = min(ages)\n",
" if len(ages) == 2:\n",
" mean_age = int((max_age + min_age) / 2)\n",
" else:\n",
" mean_age = max_age\n",
" # specially added for 18 years cases\n",
" only_18 = 0\n",
" is_y = 0\n",
" if ages == [18]:\n",
" only_18 = 1\n",
" if 'y' in _data or 'Y' in _data:\n",
" is_y = 1\n",
" under_18 = 0\n",
" if 1 < max_age < 18:\n",
" under_18 = 1\n",
" above_65 = 0\n",
" if mean_age >= 65:\n",
" above_65 = 1\n",
" # verifying whether digit is found in the given string or not.\n",
" # Example - data='18-20' digits_found=True data='????' digits_found=False\n",
" digits_found = 1\n",
" if len_digits == 1:\n",
" digits_found = 1\n",
" max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = 0, 0, 0, 0, 0, 0, 0\n",
" elif len_digits == 0:\n",
" digits_found, max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = -1, -1, -1, -1, -1, -1, -1, -1\n",
" \n",
" feature = {\n",
" 'ages': tuple(ages),\n",
" 'len(ages)': len(ages),\n",
" 'spl_chr': spl_char,\n",
" 'is_digit': digits_found,\n",
" 'max_age': max_age,\n",
" 'mean_age': mean_age,\n",
" 'only_18': only_18,\n",
" 'is_y': is_y,\n",
" 'above_65': above_65,\n",
" 'under_18': under_18\n",
" }\n",
"\n",
" return feature"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Loading dataset"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"dataset = []\n",
"with open(age_file, newline='\\n') as fp:\n",
" input_data = csv.reader(fp, delimiter=',')\n",
" for row in input_data:\n",
" dataset.append((row[1:]))\n",
"feature_sets = [(actual, correction) for (actual, correction) in dataset]\n",
"random.shuffle(feature_sets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### creating feature matrix X and response vector y"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Visualizing Feature Matrix X"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"feature_val = [val[0] for val in feature_sets]\n",
"feature_df = pd.DataFrame(feature_val)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(480, 10)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>above_65</th>\n",
" <th>ages</th>\n",
" <th>is_digit</th>\n",
" <th>is_y</th>\n",
" <th>len(ages)</th>\n",
" <th>max_age</th>\n",
" <th>mean_age</th>\n",
" <th>only_18</th>\n",
" <th>spl_chr</th>\n",
" <th>under_18</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>432</th>\n",
" <td>0</td>\n",
" <td>(63,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>63</td>\n",
" <td>63</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>154</th>\n",
" <td>0</td>\n",
" <td>(32,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>32</td>\n",
" <td>32</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>430</th>\n",
" <td>0</td>\n",
" <td>(58,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>58</td>\n",
" <td>58</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>407</th>\n",
" <td>0</td>\n",
" <td>(34,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>34</td>\n",
" <td>34</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>0</td>\n",
" <td>(34,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>34</td>\n",
" <td>34</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>462</th>\n",
" <td>1</td>\n",
" <td>(65,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>65</td>\n",
" <td>65</td>\n",
" <td>0</td>\n",
" <td>&gt;</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>1</td>\n",
" <td>(65,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>65</td>\n",
" <td>65</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>356</th>\n",
" <td>-1</td>\n",
" <td>()</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>0</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>?</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>180</th>\n",
" <td>0</td>\n",
" <td>(20, 24)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>24</td>\n",
" <td>22</td>\n",
" <td>0</td>\n",
" <td>-</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>451</th>\n",
" <td>1</td>\n",
" <td>(66,)</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>66</td>\n",
" <td>66</td>\n",
" <td>0</td>\n",
" <td></td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" above_65 ages is_digit is_y len(ages) max_age mean_age \\\n",
"432 0 (63,) 1 0 1 63 63 \n",
"154 0 (32,) 1 0 1 32 32 \n",
"430 0 (58,) 1 0 1 58 58 \n",
"407 0 (34,) 1 0 1 34 34 \n",
"65 0 (34,) 1 0 1 34 34 \n",
"462 1 (65,) 1 0 1 65 65 \n",
"97 1 (65,) 1 0 1 65 65 \n",
"356 -1 () -1 -1 0 -1 -1 \n",
"180 0 (20, 24) 1 0 2 24 22 \n",
"451 1 (66,) 1 0 1 66 66 \n",
"\n",
" only_18 spl_chr under_18 \n",
"432 0 0 \n",
"154 0 0 \n",
"430 0 0 \n",
"407 0 0 \n",
"65 0 0 \n",
"462 0 > 0 \n",
"97 0 0 \n",
"356 -1 ? -1 \n",
"180 0 - 0 \n",
"451 0 0 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_df.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Train Test Split "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"cut_point = int(len(feature_sets) * training_percent)\n",
"train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### NaiveBayes Classifier"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"nb_classifier = NaiveBayesClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy of NaiveBayesClassifier: 0.9583333333333334 \n"
]
}
],
"source": [
"print(\"Accuracy of NaiveBayesClassifier: {} \".format(classify.accuracy(nb_classifier, test_set)))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Most Informative Features\n",
" max_age = 65 65+ : 60 to = 10.4 : 1.0\n",
" max_age = 59 55 to : 50 to = 7.2 : 1.0\n",
" len(ages) = 2 60 to : Under = 6.3 : 1.0\n",
" spl_chr = '' 65+ : ? = 5.8 : 1.0\n",
" max_age = 39 35 to : 30 to = 5.7 : 1.0\n",
" only_18 = 0 30 to : ? = 4.9 : 1.0\n",
" under_18 = 0 30 to : ? = 4.9 : 1.0\n",
" is_y = 0 30 to : ? = 4.9 : 1.0\n",
" above_65 = 0 30 to : ? = 4.9 : 1.0\n",
" len(ages) = 1 65+ : ? = 4.9 : 1.0\n",
"None\n"
]
}
],
"source": [
"print(nb_classifier.show_most_informative_features(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Maxent Classifier"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" ==> Training (100 iterations)\n",
"\n",
" Iteration Log Likelihood Accuracy\n",
" ---------------------------------------\n",
" 1 -2.63906 0.055\n",
" 2 -1.71509 0.930\n",
" 3 -1.30567 0.964\n",
" 4 -1.03310 0.964\n",
" 5 -0.84445 0.966\n",
" 6 -0.70872 0.992\n",
" 7 -0.60766 0.992\n",
" 8 -0.53016 0.992\n",
" 9 -0.46919 0.992\n",
" 10 -0.42015 0.992\n",
" 11 -0.37998 0.992\n",
" 12 -0.34653 0.992\n",
" 13 -0.31829 0.992\n",
" 14 -0.29417 1.000\n",
" 15 -0.27333 1.000\n",
" 16 -0.25517 1.000\n",
" 17 -0.23921 1.000\n",
" 18 -0.22508 1.000\n",
" 19 -0.21249 1.000\n",
" 20 -0.20120 1.000\n",
" 21 -0.19103 1.000\n",
" 22 -0.18182 1.000\n",
" 23 -0.17343 1.000\n",
" 24 -0.16577 1.000\n",
" 25 -0.15875 1.000\n",
" 26 -0.15229 1.000\n",
" 27 -0.14633 1.000\n",
" 28 -0.14080 1.000\n",
" 29 -0.13568 1.000\n",
" 30 -0.13091 1.000\n",
" 31 -0.12646 1.000\n",
" 32 -0.12229 1.000\n",
" 33 -0.11839 1.000\n",
" 34 -0.11473 1.000\n",
" 35 -0.11128 1.000\n",
" 36 -0.10804 1.000\n",
" 37 -0.10497 1.000\n",
" 38 -0.10208 1.000\n",
" 39 -0.09933 1.000\n",
" 40 -0.09673 1.000\n",
" 41 -0.09426 1.000\n",
" 42 -0.09191 1.000\n",
" 43 -0.08968 1.000\n",
" 44 -0.08755 1.000\n",
" 45 -0.08552 1.000\n",
" 46 -0.08358 1.000\n",
" 47 -0.08172 1.000\n",
" 48 -0.07995 1.000\n",
" 49 -0.07825 1.000\n",
" 50 -0.07662 1.000\n",
" 51 -0.07506 1.000\n",
" 52 -0.07356 1.000\n",
" 53 -0.07211 1.000\n",
" 54 -0.07073 1.000\n",
" 55 -0.06939 1.000\n",
" 56 -0.06810 1.000\n",
" 57 -0.06686 1.000\n",
" 58 -0.06567 1.000\n",
" 59 -0.06451 1.000\n",
" 60 -0.06340 1.000\n",
" 61 -0.06232 1.000\n",
" 62 -0.06128 1.000\n",
" 63 -0.06028 1.000\n",
" 64 -0.05930 1.000\n",
" 65 -0.05836 1.000\n",
" 66 -0.05744 1.000\n",
" 67 -0.05656 1.000\n",
" 68 -0.05570 1.000\n",
" 69 -0.05487 1.000\n",
" 70 -0.05406 1.000\n",
" 71 -0.05327 1.000\n",
" 72 -0.05251 1.000\n",
" 73 -0.05177 1.000\n",
" 74 -0.05105 1.000\n",
" 75 -0.05034 1.000\n",
" 76 -0.04966 1.000\n",
" 77 -0.04900 1.000\n",
" 78 -0.04835 1.000\n",
" 79 -0.04772 1.000\n",
" 80 -0.04711 1.000\n",
" 81 -0.04651 1.000\n",
" 82 -0.04593 1.000\n",
" 83 -0.04536 1.000\n",
" 84 -0.04480 1.000\n",
" 85 -0.04426 1.000\n",
" 86 -0.04373 1.000\n",
" 87 -0.04322 1.000\n",
" 88 -0.04271 1.000\n",
" 89 -0.04222 1.000\n",
" 90 -0.04174 1.000\n",
" 91 -0.04127 1.000\n",
" 92 -0.04081 1.000\n",
" 93 -0.04036 1.000\n",
" 94 -0.03992 1.000\n",
" 95 -0.03949 1.000\n",
" 96 -0.03907 1.000\n",
" 97 -0.03865 1.000\n",
" 98 -0.03825 1.000\n",
" 99 -0.03785 1.000\n",
" Final -0.03747 1.000\n"
]
}
],
"source": [
"max_classifier = MaxentClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy of MaxentClassifier: 0.9791666666666666 \n"
]
}
],
"source": [
"print(\"Accuracy of MaxentClassifier: {} \".format(classify.accuracy(max_classifier, test_set)))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 6.505 is_y==1 and label is '18 to 20'\n",
" -6.145 spl_chr=='' and label is '?'\n",
" 4.921 ages==(7,) and label is '?'\n",
" 4.921 mean_age==0 and label is '?'\n",
" 4.921 max_age==0 and label is '?'\n",
" 4.000 ages==(61, 65) and label is '60 to 64'\n",
" 4.000 mean_age==63 and label is '60 to 64'\n",
" 3.910 ages==(18, 21) and label is '18 to 20'\n",
" 3.883 ages==(50, 59) and label is '50 to 54'\n",
" 3.845 ages==(56, 60) and label is '55 to 59'\n",
"None\n"
]
}
],
"source": [
"print(max_classifier.show_most_informative_features(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Decision Tree Classifier"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"decision_classifier = DecisionTreeClassifier.train(train_set)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy of DecisionTreeClassifier: 0.9166666666666666 \n"
]
}
],
"source": [
"print(\"Accuracy of DecisionTreeClassifier: {} \".format(classify.accuracy(decision_classifier, test_set)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4. Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Enter q (or) quit to end this test module\n",
"\n",
"Enter data for testing: q\n",
"End\n"
]
}
],
"source": [
"print('Enter q (or) quit to end this test module')\n",
"while 1:\n",
" data = input('\\nEnter data for testing: ')\n",
" if data.lower() == 'q' or data.lower() == 'quit':\n",
" print('End')\n",
" break\n",
"\n",
" if not len(data):\n",
" continue\n",
"\n",
" features = feature_extraction(data)\n",
" print(features)\n",
" prediction = [nb_classifier.classify(features),\n",
" max_classifier.classify(features),\n",
" decision_classifier.classify(features)]\n",
"\n",
" print('NaiveBayes Classifier : ', prediction[0])\n",
" print('Maxent Classifier : ', prediction[1])\n",
" print('Decision Tree Classifier : ', prediction[2])\n",
" print('-'*75)\n",
" print('(Best of 3) = ', max(set(prediction), key=prediction.count))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment