Created
October 14, 2015 10:10
-
-
Save juanshishido/dc71389d3fd371fe7979 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# Text Classification Features and NLTK Classification Code #\nThis example is based on the NLTK book and uses the Names collection to guess gender of names." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "import random\n\nimport nltk\nfrom nltk.corpus import names", | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "names.words()[:5]", | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"execution_count": 2, | |
"data": { | |
"text/plain": "['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi']" | |
}, | |
"output_type": "execute_result", | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** A feature recognition function **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "def gender_features(word):\n return {'last_letter': word[-1]}\ngender_features('Samantha')", | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"execution_count": 3, | |
"data": { | |
"text/plain": "{'last_letter': 'a'}" | |
}, | |
"output_type": "execute_result", | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Create name datasets ** " | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "def create_name_data():\n male_names = [(name, 'male') for name in names.words('male.txt')]\n female_names = [(name, 'female') for name in names.words('female.txt')]\n allnames = male_names + female_names\n \n # Randomize the order of male and female names, and de-alphabatize\n random.shuffle(allnames)\n return allnames\n\nnames_data = create_name_data()", | |
"execution_count": 4, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Make Training, Development, and Test Data Sets **\n\nWe need a development set to test our features on before testing on the real test set. So let's redo our division of the data. In this case we do the dividing up before applying the feature selection so we can keep track of the names." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "# This function allows experimentation with different feature definitions\n# items is a list of (key, value) pairs from which features are extracted and training sets are made\n# Feature sets returned are dictionaries of features\n\n# This function also optionally returns the names of the training, development, \n# and test data for the purposes of error checking\n\ndef create_training_sets (feature_function, items, return_items=False):\n # Create the features sets. Call the function that was passed in.\n # For names data, key is the name, and value is the gender\n featuresets = [(feature_function(key), value) for (key, value) in items]\n \n # Divided training and testing in thirds. Could divide in other proportions instead.\n third = int(float(len(featuresets)) / 3.0)\n \n train_set, dev_set, test_set = featuresets[0:third], featuresets[third:third*2], featuresets[third*2:]\n train_items, dev_items, test_items = items[0:third], items[third:third*2], items[third*2:]\n if return_items == True:\n return train_set, dev_set, test_set, train_items, dev_items, test_items\n else:\n return train_set, dev_set, test_set", | |
"execution_count": 5, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Train the nltk classifier on the training data, with the first definition of features **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "# pass in a function name\ntrain_set, dev_set, test_set = create_training_sets(gender_features, names_data)\ncl = nltk.NaiveBayesClassifier.train(train_set)", | |
"execution_count": 6, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Test the classifier on some examples **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "print (\"Carl: \" + cl.classify(gender_features('Carl')))\nprint (\"Carla: \" + cl.classify(gender_features('Carla')))\nprint (\"Carly: \" + cl.classify(gender_features('Carly')))\nprint (\"Carlo: \" + cl.classify(gender_features('Carlo')))\nprint (\"Carlos: \" + cl.classify(gender_features('Carlos')))\n", | |
"execution_count": 7, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "Carl: female\nCarla: female\nCarly: female\nCarlo: male\nCarlos: male\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "print (\"Carli: \" + cl.classify(gender_features('Carli')))\nprint (\"Carle: \" + cl.classify(gender_features('Carle')))\nprint (\"Charles: \" + cl.classify(gender_features('Charles')))\nprint (\"Carlie: \" + cl.classify(gender_features('Carlie')))\nprint (\"Charlie: \" + cl.classify(gender_features('Charlie')))", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "Carli: female\nCarle: female\nCharles: male\nCarlie: female\nCharlie: female\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Run the NLTK evaluation function on the development set **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "print (\"%.3f\" % nltk.classify.accuracy(cl, dev_set))", | |
"execution_count": 9, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "0.757\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Run the NLTK feature inspection function on the classifier **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "cl.show_most_informative_features(15)", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "Most Informative Features\n last_letter = 'a' female : male = 29.5 : 1.0\n last_letter = 'k' male : female = 28.3 : 1.0\n last_letter = 'v' male : female = 10.5 : 1.0\n last_letter = 'd' male : female = 9.7 : 1.0\n last_letter = 'm' male : female = 9.3 : 1.0\n last_letter = 'r' male : female = 9.1 : 1.0\n last_letter = 'p' male : female = 7.2 : 1.0\n last_letter = 'o' male : female = 6.5 : 1.0\n last_letter = 'w' male : female = 6.1 : 1.0\n last_letter = 'x' male : female = 5.0 : 1.0\n last_letter = 't' male : female = 4.5 : 1.0\n last_letter = 'i' female : male = 4.5 : 1.0\n last_letter = 's' male : female = 4.3 : 1.0\n last_letter = 'z' male : female = 3.9 : 1.0\n last_letter = 'g' male : female = 3.5 : 1.0\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Let's add some more features to improve results **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "def vowel_count(name):\n count = 0\n for vowel in 'aeiou':\n for letter in name:\n if vowel in letter:\n count += 1\n return count", | |
"execution_count": 11, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "'samantha'.count('a')", | |
"execution_count": 12, | |
"outputs": [ | |
{ | |
"execution_count": 12, | |
"data": { | |
"text/plain": "3" | |
}, | |
"output_type": "execute_result", | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "def gender_features2(word):\n features = {}\n word = word.lower()\n features['last'] = word[-1]\n features['first'] = word[:1]\n features['second'] = word[1:2] # get the 'h' in Charlie?\n return features\ngender_features2('Samantha') ", | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"execution_count": 13, | |
"data": { | |
"text/plain": "{'first': 's', 'last': 'a', 'second': 'a'}" | |
}, | |
"output_type": "execute_result", | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** We wrote the code so that we can easily pass in the new feature function. Lets see if this improves the results on the development set.**" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "train_set2, dev_set2, test_set2 = create_training_sets(gender_features2, names_data)\ncl2 = nltk.NaiveBayesClassifier.train(train_set2)\nprint (\"%.3f\" % nltk.classify.accuracy(cl2, dev_set2))", | |
"execution_count": 14, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "0.768\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "cl2.show_most_informative_features(15)", | |
"execution_count": 15, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "Most Informative Features\n last = 'a' female : male = 29.5 : 1.0\n last = 'k' male : female = 28.3 : 1.0\n last = 'v' male : female = 10.5 : 1.0\n last = 'd' male : female = 9.7 : 1.0\n last = 'm' male : female = 9.3 : 1.0\n last = 'r' male : female = 9.1 : 1.0\n last = 'p' male : female = 7.2 : 1.0\n last = 'o' male : female = 6.5 : 1.0\n last = 'w' male : female = 6.1 : 1.0\n second = 'c' male : female = 5.0 : 1.0\n last = 'x' male : female = 5.0 : 1.0\n first = 'w' male : female = 4.8 : 1.0\n last = 't' male : female = 4.5 : 1.0\n last = 'i' female : male = 4.5 : 1.0\n last = 's' male : female = 4.3 : 1.0\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Let's hand check some of the harder cases ... oops some are right but some are now wrong. **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "print (\"Carli: \" + cl2.classify(gender_features('Carli')))\nprint (\"Carle: \" + cl2.classify(gender_features('Carle')))\nprint (\"Charles: \" + cl2.classify(gender_features('Charles')))\nprint (\"Carlie: \" + cl2.classify(gender_features('Carlie')))\nprint (\"Charlie: \" + cl2.classify(gender_features('Charlie')))", | |
"execution_count": 16, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "Carli: female\nCarle: female\nCharles: female\nCarlie: female\nCharlie: female\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** We can see the influence of some of the new features **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "cl2.show_most_informative_features(15)", | |
"execution_count": 17, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "Most Informative Features\n last = 'a' female : male = 29.5 : 1.0\n last = 'k' male : female = 28.3 : 1.0\n last = 'v' male : female = 10.5 : 1.0\n last = 'd' male : female = 9.7 : 1.0\n last = 'm' male : female = 9.3 : 1.0\n last = 'r' male : female = 9.1 : 1.0\n last = 'p' male : female = 7.2 : 1.0\n last = 'o' male : female = 6.5 : 1.0\n last = 'w' male : female = 6.1 : 1.0\n second = 'c' male : female = 5.0 : 1.0\n last = 'x' male : female = 5.0 : 1.0\n first = 'w' male : female = 4.8 : 1.0\n last = 't' male : female = 4.5 : 1.0\n last = 'i' female : male = 4.5 : 1.0\n last = 's' male : female = 4.3 : 1.0\n" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**Below we use code from the NLTK chapter to print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. We use the feature of the training set function that let us get the original names from the training and development set**" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false | |
}, | |
"cell_type": "markdown", | |
"source": "train_set3, dev_set3, test_set3, train_items, dev_items, test_items = create_training_sets(gender_features2, names_data, True)\ncl3 = nltk.NaiveBayesClassifier.train(train_set3)\n\\# This is code from the NLTK chapter\nerrors = []\nfor (name, label) in dev_items:\n print(str(name) + \" \" + str(label))\n guess = cl3.classify(gender_features2(name))\n if guess != label:\n errors.append( (label, guess, name) )" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. **" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false | |
}, | |
"cell_type": "markdown", | |
"source": "for (tag, guess, name) in sorted(errors): \n print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "** Exercise** Rewrite the feature function above to add some additional features, and then rerun the classifier on the development set to evaluate if it improves or degrades results. Check the results on the dev items to see where you still make errors and add or remove features. When you are satisfied with the results, *freeze your algorithm* and ** run it one time only on the test collection ** and report the results with the evaluation function. \n\nIdeas for features:\n* name length\n* pairs of characters\n* your idea goes here" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "def gender_features3(word):\n features = {}\n word = word.lower()\n features['last'] = word[-1]\n features['last_two'] = word[-2:]\n features['last_three'] = word[-3:]\n features['first'] = word[0]\n features['first_two'] = word[:2]\n features['first_three'] = word[:3]\n features['first_four'] = word[:4]\n features['length'] = len(word)\n return features", | |
"execution_count": 18, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "train_set3, dev_set3, test_set3 = create_training_sets(gender_features3, names_data)\ncl3 = nltk.NaiveBayesClassifier.train(train_set3)\nprint (\"%.3f\" % nltk.classify.accuracy(cl3, dev_set3))", | |
"execution_count": 19, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "0.834\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "print (\"%.3f\" % nltk.classify.accuracy(cl3, test_set3))", | |
"execution_count": 20, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "0.830\n" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"collapsed": true, | |
"trusted": false | |
}, | |
"cell_type": "code", | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"language_info": { | |
"pygments_lexer": "ipython3", | |
"name": "python", | |
"codemirror_mode": { | |
"version": 3, | |
"name": "ipython" | |
}, | |
"version": "3.4.2", | |
"mimetype": "text/x-python", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment