Skip to content

Instantly share code, notes, and snippets.

@fayeip
Last active August 29, 2015 14:08
Show Gist options
  • Save fayeip/54e2f31d85080da3302e to your computer and use it in GitHub Desktop.
Save fayeip/54e2f31d85080da3302e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "code",
"input": "import nltk\nfrom nltk.corpus import names\nimport random",
"prompt_number": 41,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** A feature recognition function **"
},
{
"metadata": {},
"cell_type": "code",
"input": "def gender_features(word):\n return {'last_letter': word[-1]}\ngender_features('Samantha')",
"prompt_number": 42,
"outputs": [
{
"text": "{'last_letter': 'a'}",
"output_type": "pyout",
"metadata": {},
"prompt_number": 42
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Create name datasets ** "
},
{
"metadata": {},
"cell_type": "code",
"input": "def create_name_data():\n male_names = [(name, 'male') for name in names.words('male.txt')]\n female_names = [(name, 'female') for name in names.words('female.txt')]\n allnames = male_names + female_names\n \n # Randomize the order of male and female names, and de-alphabatize\n random.shuffle(allnames)\n return allnames\n\nnames_data = create_name_data()",
"prompt_number": 43,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** First Pass at Training and Testing Data **"
},
{
"metadata": {},
"cell_type": "code",
"input": "\n# This function allows experimentation with different feature definitions\n# items is a list of (key, value) pairs from which features are extracted and training sets are made\ndef create_training_sets (feature_function, items):\n # Create the features sets. Call the function that was passed in.\n # For names, key is the name, and value is the gender\n featuresets = [(feature_function(key), value) for (key, value) in items]\n \n # Divided training and testing in half. Could divide in other proportions instead.\n halfsize = int(float(len(featuresets)) / 2.0)\n train_set, test_set = featuresets[halfsize:], featuresets[:halfsize]\n return train_set, test_set",
"prompt_number": 44,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Train the classifier on the training data, with the first definition of features **"
},
{
"metadata": {},
"cell_type": "code",
"input": "# pass in a function name\ntrain_set, test_set = create_training_sets(gender_features, names_data)\ncl = nltk.NaiveBayesClassifier.train(train_set)",
"prompt_number": 45,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Test the classifier on some examples **"
},
{
"metadata": {},
"cell_type": "code",
"input": "print cl.classify(gender_features('Carl'))\nprint cl.classify(gender_features('Carla'))\nprint cl.classify(gender_features('Carly'))\nprint cl.classify(gender_features('Carlo'))\nprint cl.classify(gender_features('Carlos'))\n",
"prompt_number": 46,
"outputs": [
{
"output_type": "stream",
"text": "female\nfemale\nfemale\nmale\nmale\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "print cl.classify(gender_features('Carli'))\nprint cl.classify(gender_features('Carle'))\nprint cl.classify(gender_features('Charles'))\nprint cl.classify(gender_features('Carlie'))\nprint cl.classify(gender_features('Charlie'))",
"prompt_number": 47,
"outputs": [
{
"output_type": "stream",
"text": "female\nfemale\nmale\nfemale\nfemale\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Run the NLTK evaluation function on the test set **"
},
{
"metadata": {},
"cell_type": "code",
"input": "print \"%.3f\" % nltk.classify.accuracy(cl, test_set)",
"prompt_number": 48,
"outputs": [
{
"output_type": "stream",
"text": "0.762\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Run the NLTK feature inspection function on the classifier **"
},
{
"metadata": {},
"cell_type": "code",
"input": "cl.show_most_informative_features(15)",
"prompt_number": 49,
"outputs": [
{
"output_type": "stream",
"text": "Most Informative Features\n last_letter = 'k' male : female = 30.1 : 1.0\n last_letter = 'a' female : male = 27.8 : 1.0\n last_letter = 'm' male : female = 14.5 : 1.0\n last_letter = 'f' male : female = 14.1 : 1.0\n last_letter = 'p' male : female = 10.5 : 1.0\n last_letter = 'd' male : female = 9.2 : 1.0\n last_letter = 'v' male : female = 8.5 : 1.0\n last_letter = 'o' male : female = 7.8 : 1.0\n last_letter = 'u' male : female = 5.8 : 1.0\n last_letter = 'r' male : female = 5.7 : 1.0\n last_letter = 's' male : female = 5.2 : 1.0\n last_letter = 'w' male : female = 4.4 : 1.0\n last_letter = 'i' female : male = 4.3 : 1.0\n last_letter = 't' male : female = 4.2 : 1.0\n last_letter = 'g' male : female = 3.5 : 1.0\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Let's add some more features to improve results **"
},
{
"metadata": {},
"cell_type": "code",
"input": "def gender_features2(word):\n features = {}\n word = word.lower()\n features['last'] = word[-1]\n features['first'] = word[:1]\n features['second'] = word[1:2] # get the 'h' in Charlie?\n return features\ngender_features2('Samantha') ",
"prompt_number": 50,
"outputs": [
{
"text": "{'first': 's', 'last': 'a', 'second': 'a'}",
"output_type": "pyout",
"metadata": {},
"prompt_number": 50
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** We wrote the code so that we can easily pass in the new feature function. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "train_set2, test_set2 = create_training_sets(gender_features2, names_data)\ncl2 = nltk.NaiveBayesClassifier.train(train_set2)\nprint \"%.3f\" % nltk.classify.accuracy(cl2, test_set2)",
"prompt_number": 51,
"outputs": [
{
"output_type": "stream",
"text": "0.771\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Let's hand check some of the harder cases ... oops some are right but some are now wrong. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "print cl2.classify(gender_features2('Carli'))\nprint cl2.classify(gender_features2('Carle'))\nprint cl2.classify(gender_features2('Charles')) \nprint cl2.classify(gender_features2('Carlie'))\nprint cl2.classify(gender_features2('Charlie'))",
"prompt_number": 52,
"outputs": [
{
"output_type": "stream",
"text": "female\nfemale\nmale\nfemale\nfemale\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** We can see the influence of some of the new features **"
},
{
"metadata": {},
"cell_type": "code",
"input": "cl2.show_most_informative_features(15)",
"prompt_number": 53,
"outputs": [
{
"output_type": "stream",
"text": "Most Informative Features\n last = 'k' male : female = 30.1 : 1.0\n last = 'a' female : male = 27.8 : 1.0\n last = 'm' male : female = 14.5 : 1.0\n last = 'f' male : female = 14.1 : 1.0\n last = 'p' male : female = 10.5 : 1.0\n last = 'd' male : female = 9.2 : 1.0\n last = 'v' male : female = 8.5 : 1.0\n last = 'o' male : female = 7.8 : 1.0\n second = 'k' male : female = 7.3 : 1.0\n second = 'c' male : female = 5.8 : 1.0\n last = 'u' male : female = 5.8 : 1.0\n last = 'r' male : female = 5.7 : 1.0\n last = 's' male : female = 5.2 : 1.0\n first = 'q' male : female = 5.1 : 1.0\n first = 'w' male : female = 4.9 : 1.0\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** We really need a development set to test our features on before testing on the real test set. So let's redo our division of the data. In this case we do the dividing up before applying the feature selection so we can keep track of the names. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "def create_training_sets3 (feature_function, items):\n # Create the features sets. Call the function that was passed in.\n # For names, key is the name, and value is the gender\n featuresets = [(feature_function(key), value) for (key, value) in items]\n \n # Divide data into thirds\n third = int(float(len(featuresets)) / 3.0)\n return items[0:third], items[third:third*2], items[third*2:], featuresets[0:third], featuresets[third:third*2], featuresets[third*2:]\n \ntrain_items, dev_items, test_items, train_features, dev_features, test_features = create_training_sets3(gender_features2, names_data)\n",
"prompt_number": 54,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "cl3 = nltk.NaiveBayesClassifier.train(train_features)\n# This is code from the NLTK chapter\nerrors = []\nfor (name, tag) in dev_items:\n guess = cl3.classify(gender_features2(name))\n if guess != tag:\n errors.append( (tag, guess, name) )",
"prompt_number": 55,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. **"
},
{
"metadata": {},
"cell_type": "code",
"input": "# for (tag, guess, name) in sorted(errors): \n# print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)",
"prompt_number": 56,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "** Exercise** Rewrite the functions above to add some additional features, and then rerun the classifier to evaluate if they improve or degrade results. But don't overfit!\n\nIdeas for features:\n* name length\n* pairs of characters\n* your idea goes here"
},
{
"metadata": {},
"cell_type": "code",
"input": "def gender_features4(word):\n features = {}\n word = word.lower()\n features['last'] = word[-1]\n features['lasttwo'] = word[-2:]\n features['lastthree'] = word[-3:]\n features['first'] = word[0]\n return features\ngender_features4('Samantha') \n",
"prompt_number": 57,
"outputs": [
{
"text": "{'first': 's', 'last': 'a', 'lastthree': 'tha', 'lasttwo': 'ha'}",
"output_type": "pyout",
"metadata": {},
"prompt_number": 57
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "def create_training_sets4 (feature_function, items):\n # Create the features sets. Call the function that was passed in.\n # For names, key is the name, and value is the gender\n featuresets = [(feature_function(key), value) for (key, value) in items]\n \n # Divide data into thirds\n third = int(float(len(featuresets)) / 3.0)\n return items[0:third], items[third:third*2], items[third*2:], featuresets[0:third], featuresets[third:third*2], featuresets[third*2:]\n \ntrain_items, dev_items, test_items, train_features, dev_features, test_features = create_training_sets4(gender_features4, names_data)\n",
"prompt_number": 58,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "cl4 = nltk.NaiveBayesClassifier.train(train_features)\nprint \"%.3f\" % nltk.classify.accuracy(cl4, dev_features)",
"prompt_number": 59,
"outputs": [
{
"output_type": "stream",
"text": "0.807\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "cl4.show_most_informative_features(15) #looks like my features are much better for predicting female names than male",
"prompt_number": 60,
"outputs": [
{
"output_type": "stream",
"text": "Most Informative Features\n last = 'a' female : male = 54.3 : 1.0\n lasttwo = 'ta' female : male = 29.9 : 1.0\n lasttwo = 'rd' male : female = 28.6 : 1.0\n lasttwo = 'ra' female : male = 25.2 : 1.0\n lasttwo = 'la' female : male = 24.4 : 1.0\n lastthree = 'ard' male : female = 19.9 : 1.0\n lastthree = 'son' male : female = 14.8 : 1.0\n lasttwo = 'ar' male : female = 11.8 : 1.0\n lastthree = 'ter' male : female = 11.7 : 1.0\n last = 'f' male : female = 11.1 : 1.0\n lasttwo = 'ro' male : female = 10.6 : 1.0\n last = 'o' male : female = 10.6 : 1.0\n last = 'd' male : female = 10.5 : 1.0\n lastthree = 'ita' female : male = 10.3 : 1.0\n lasttwo = 'os' male : female = 9.5 : 1.0\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "errors = []\nfor (name, tag) in dev_items:\n guess = cl4.classify(gender_features4(name))\n if guess != tag:\n errors.append( (tag, guess, name) )\n\nfor (tag, guess, name) in sorted(errors)[0:20]: \n print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)",
"prompt_number": 61,
"outputs": [
{
"output_type": "stream",
"text": "correct=female guess=male name=Adelind \ncorrect=female guess=male name=Adrian \ncorrect=female guess=male name=Ag \ncorrect=female guess=male name=Agnes \ncorrect=female guess=male name=Aigneis \ncorrect=female guess=male name=Ainsley \ncorrect=female guess=male name=Alex \ncorrect=female guess=male name=Allyson \ncorrect=female guess=male name=Amargo \ncorrect=female guess=male name=Amber \ncorrect=female guess=male name=Andromache \ncorrect=female guess=male name=Anett \ncorrect=female guess=male name=Annabell \ncorrect=female guess=male name=Ansley \ncorrect=female guess=male name=April \ncorrect=female guess=male name=Ashley \ncorrect=female guess=male name=Astrid \ncorrect=female guess=male name=Babs \ncorrect=female guess=male name=Bamby \ncorrect=female guess=male name=Barry \n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:dab819b8c358935ef236ef2b7a54e11fbfd9956d52defe651594734e152f3ce3",
"gist_id": "54e2f31d85080da3302e"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment