Last active
October 14, 2016 02:32
-
-
Save chokkan/a962dbf16f070bd0df5d182040a39f3b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Binary classification with perceptron" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This notebook is available at: https://gist.github.com/chokkan/a962dbf16f070bd0df5d182040a39f3b" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Toy example\n", | |
"\n", | |
"In this notebook, a feature vector $\\phi(x)$ of an instance $x$ is represented by a mapping (dictionary) object `x` whose keys are feature names and values are feature values. In other words, feature indices are represented by any strings rather than by integers. For example,\n", | |
"\n", | |
"```\n", | |
"x = {}\n", | |
"x['darling'] = 1\n", | |
"x['photo'] = 1\n", | |
"x['attach'] = 1\n", | |
"```\n", | |
"\n", | |
"This representation is useful because the feature space for a natural language is high dimensional and sparse. If we define a feature space as occurrences of every word, \n", | |
"\n", | |
"* the number of the dimension of the feature space ($d$) equals to the total number of words in the language, which typically amounts to 1M words.\n", | |
"* although a feature vector is represented by $d$-dimensional vector, most elements in the vector are zero; only a limited number of elements corresponding to the word in a sentence have non-zero values.\n", | |
"\n", | |
"A binary label $y$ is either `+1` (positive) or `-1` (negative)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import collections\n", | |
"import functools\n", | |
"import math\n", | |
"import operator\n", | |
"import random" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is the example used in the lecture, $[(x_1, y_1), (x_2, y_2)]$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Hi darling, my photo in attached file\n", | |
"x1 = {'@bias': 1, 'hi_darl':1, 'darl_my':1, 'my_photo':1, 'photo_attach':1, 'attach_file':1}\n", | |
"y1 = +1\n", | |
"\n", | |
"# Hi Mark, Kyoto photo in attached file\n", | |
"x2 = {'@bias': 1, 'hi_mark':1, 'mark_kyoto':1, 'kyoto_photo':1, 'photo_attach':1, 'attach_file':1}\n", | |
"y2 = -1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'@bias': 1,\n", | |
" 'attach_file': 1,\n", | |
" 'darl_my': 1,\n", | |
" 'hi_darl': 1,\n", | |
" 'my_photo': 1,\n", | |
" 'photo_attach': 1}" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"x1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'@bias': 1,\n", | |
" 'attach_file': 1,\n", | |
" 'hi_mark': 1,\n", | |
" 'kyoto_photo': 1,\n", | |
" 'mark_kyoto': 1,\n", | |
" 'photo_attach': 1}" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"x2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Perceptron" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is an implementation of the perceptron algorithm." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def dot_product(w, x):\n", | |
" \"\"\"Inner product, w \\cdot x.\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector as a mapping object: feature -> value.\n", | |
" Returns:\n", | |
" the inner product, w \\cdot x.\n", | |
"\n", | |
" \"\"\"\n", | |
"\n", | |
" a = 0.\n", | |
" for f, v in x.iteritems():\n", | |
" a += w.get(f, 0.) * v\n", | |
" return a\n", | |
"\n", | |
"def predict(w, x):\n", | |
" \"\"\"Predict the label of an instance.\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector as a mapping object: feature -> value.\n", | |
" Returns:\n", | |
" the predicted label: +1 (true) or -1 (false).\n", | |
" \"\"\"\n", | |
" return +1 if dot_product(w, x) >= 0 else -1 \n", | |
"\n", | |
"def update_perceptron(w, x, y):\n", | |
" \"\"\"Update the model with a training instance (x, y).\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector of the training instance as a mapping object.\n", | |
" y: label of the training instance, -1 or +1.\n", | |
"\n", | |
" \"\"\"\n", | |
" yp = predict(w, x)\n", | |
" if yp * y < 0:\n", | |
" for f, v in x.iteritems():\n", | |
" w[f] += y * v" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Initialization\n", | |
"We represent the weight vector (model) with a special dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature weight)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float, {})" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"w = collections.defaultdict(float)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Iteration #1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Predict the label $\\hat{y_1}$ for the instance $x_1$. This is a correct prediction ($\\hat{y_1} = y_1 = +1$) because we assume an inner product of zero as a positive label." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Iteration #2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Then, predict the label $\\hat{y_2}$ for the instance $x_2$. This is an incorrect prediction ($\\hat{y_2} \\neq y_2 = -1$)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Because the model could not predict the instance $(x_2, y_2)$ correctly, update the feature weights." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"update_perceptron(w, x2, y2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Check the current feature weights $w$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float,\n", | |
" {'@bias': -1.0,\n", | |
" 'attach_file': -1.0,\n", | |
" 'hi_mark': -1.0,\n", | |
" 'kyoto_photo': -1.0,\n", | |
" 'mark_kyoto': -1.0,\n", | |
" 'photo_attach': -1.0})" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we can confirm that the updated weights can predict the instance $(x_2, y_2)$ correctly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x2) == y2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-6.0" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Iterations #3" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Predict the label $\\hat{y_1}$ for the instance $x_1$. This is an incorrect prediction ($\\hat{y_1} \\neq y_1 = +1$)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-1" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-3.0" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Because the model could not predict the instance $(x_1, y_1)$ correctly, update the feature weights." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"update_perceptron(w, x1, y1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Check the current feature weights $w$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float,\n", | |
" {'@bias': 0.0,\n", | |
" 'attach_file': 0.0,\n", | |
" 'darl_my': 1.0,\n", | |
" 'hi_darl': 1.0,\n", | |
" 'hi_mark': -1.0,\n", | |
" 'kyoto_photo': -1.0,\n", | |
" 'mark_kyoto': -1.0,\n", | |
" 'my_photo': 1.0,\n", | |
" 'photo_attach': 0.0})" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we can confirm that the updated weights can predict the instance $(x_1, y_1)$ correctly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x1) == y1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3.0" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Iterations #4\n", | |
"\n", | |
"We can confirm that the current feature weights can can predict the instance $(x_2, y_2)$ correctly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x2) == y2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Iterations #5\n", | |
"\n", | |
"We can confirm that the current feature weights can can predict the instance $(x_1, y_1)$ correctly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x1) == y1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Sentiment analysis" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us build a sentiment predictor (positive/negative) by using [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz) distributed in [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preparing the data set" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We download the dataset and prepare the training data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2016-10-14 11:23:54-- http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\n", | |
"Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20\n", | |
"Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 487770 (476K) [application/x-gzip]\n", | |
"Saving to: ‘rt-polaritydata.tar.gz’\n", | |
"\n", | |
"100%[======================================>] 487,770 436KB/s in 1.1s \n", | |
"\n", | |
"2016-10-14 11:23:56 (436 KB/s) - ‘rt-polaritydata.tar.gz’ saved [487770/487770]\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"rt-polaritydata.README.1.0.txt\n", | |
"rt-polaritydata/rt-polarity.neg\n", | |
"rt-polaritydata/rt-polarity.pos\n" | |
] | |
} | |
], | |
"source": [ | |
"!tar xvzf rt-polaritydata.tar.gz" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us check the training instances in the tar-ball." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \r\n", | |
"the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . \r\n", | |
"effective but too-tepid biopic\r\n", | |
"if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \r\n", | |
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \r\n" | |
] | |
} | |
], | |
"source": [ | |
"!head -n5 rt-polaritydata/rt-polarity.pos" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"simplistic , silly and tedious . \r\n", | |
"it's so laddish and juvenile , only teenage boys could possibly find it funny . \r\n", | |
"exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \r\n", | |
"[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . \r\n", | |
"a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . \r\n" | |
] | |
} | |
], | |
"source": [ | |
"!head -n5 rt-polaritydata/rt-polarity.neg" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Merge positive and negative instances after inserting '+1' at the beginning of each line in the positive data and '-1' that in the negative data. Sort the order of the instances at random." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!sed \"s/^/+1 /g\" rt-polaritydata/rt-polarity.pos > positives.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!sed \"s/^/-1 /g\" rt-polaritydata/rt-polarity.neg > negatives.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!sort -R positives.txt negatives.txt > data.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"-1 schmaltzy and unfunny , adam sandler's cartoon about hanukkah is numbingly bad , little nicky bad , 10 worst list bad . \r\n", | |
"-1 although there are several truly jolting scares , there's also an abundance of hackneyed dialogue and more silly satanic business than you can shake a severed limb at . \r\n", | |
"-1 fails to satisfactorily exploit its gender politics , genre thrills or inherent humor . \r\n", | |
"-1 it almost feels as if the movie is more interested in entertaining itself than in amusing us . \r\n", | |
"-1 it's the type of stunt the academy loves : a powerful political message stuffed into an otherwise mediocre film . \r\n", | |
"-1 broder's screenplay is shallow , offensive and redundant , with pitifully few real laughs . \r\n", | |
"+1 grant gets to display his cadness to perfection , but also to show acting range that may surprise some who thought light-hearted comedy was his forte . \r\n", | |
"+1 as ex-marine walter , who may or may not have shot kennedy , actor raymond j . barry is perfectly creepy and believable . \r\n", | |
"-1 it wouldn't be my preferred way of spending 100 minutes or $7 . 00 . \r\n", | |
"-1 the picture doesn't know it's a comedy . \r\n" | |
] | |
} | |
], | |
"source": [ | |
"!head -n 10 data.txt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Count the number of positive and negative instances." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"5331\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!grep '^+1' data.txt | wc -l" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"5331\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!grep '^-1' data.txt | wc -l" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Implementing a feature extractor\n", | |
"\n", | |
"We implement a feature extractor which converts a text into a sparse vector. We use a stop list distributed on the Web." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2016-10-14 11:23:58-- http://www.textfixer.com/resources/common-english-words.txt\n", | |
"Resolving www.textfixer.com (www.textfixer.com)... 216.172.105.85\n", | |
"Connecting to www.textfixer.com (www.textfixer.com)|216.172.105.85|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 551 [text/plain]\n", | |
"Saving to: ‘common-english-words.txt’\n", | |
"\n", | |
"100%[======================================>] 551 --.-K/s in 0s \n", | |
"\n", | |
"2016-10-14 11:23:58 (54.7 MB/s) - ‘common-english-words.txt’ saved [551/551]\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"!wget http://www.textfixer.com/resources/common-english-words.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your" | |
] | |
} | |
], | |
"source": [ | |
"!cat common-english-words.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from stemming.porter2 import stem\n", | |
"\n", | |
"stoplist = set(open('common-english-words.txt').read().split(','))\n", | |
"\n", | |
"def is_non_stop(x):\n", | |
" return x not in stoplist\n", | |
"\n", | |
"def has_alnum(x):\n", | |
" return any((c.isalnum() for c in x))\n", | |
"\n", | |
"def feature(s):\n", | |
" \"\"\"Feature extractor (from a sequence of words).\n", | |
" \n", | |
" Args:\n", | |
" s: a list of words in a sentence.\n", | |
" Returns:\n", | |
" feature vector as a mapping object: feature -> value.\n", | |
" \n", | |
" \"\"\"\n", | |
" # Remove stop words (find words x \\in s where is_non_stop(x) is True)\n", | |
" x = filter(is_non_stop, s)\n", | |
" # Apply stemming (apply stem(i) for all i \\in x)\n", | |
" x = map(stem, x)\n", | |
" # Remove non alphanumeric words.\n", | |
" x = filter(has_alnum, x)\n", | |
" # Append the bias feature\n", | |
" x.append('@bias')\n", | |
" # Unigram features (the number of occurrences of each word)\n", | |
" return dict(collections.Counter(x))\n", | |
"\n", | |
"def T2F(text):\n", | |
" \"\"\"Feature extractor (from a natural sentence).\n", | |
" \n", | |
" Args:\n", | |
" text: a sentence.\n", | |
" Returns:\n", | |
" feature vector as a mapping object: feature -> value.\n", | |
" \n", | |
" \"\"\"\n", | |
" return feature(text.lower().split(' '))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us check the feature extractor." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'@bias': 1, 'silli': 1, 'simplist': 1, 'tedious': 1}" | |
] | |
}, | |
"execution_count": 35, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"T2F('simplistic , silly and tedious .')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'@bias': 1,\n", | |
" 'boy': 1,\n", | |
" 'find': 1,\n", | |
" 'funni': 1,\n", | |
" 'it': 1,\n", | |
" 'juvenil': 1,\n", | |
" 'laddish': 1,\n", | |
" 'possibl': 1,\n", | |
" 'teenag': 1}" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"T2F(\"it's so laddish and juvenile , only teenage boys could possibly find it funny . \")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Load the data set" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Read the instances in `data.txt` and store each instance in an `Instance` object." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"class Instance:\n", | |
" def __init__(self, x, y, text):\n", | |
" self.x = x\n", | |
" self.y = y\n", | |
" self.text = text\n", | |
" def __repr__(self):\n", | |
" return repr((self.y, self.x))\n", | |
"\n", | |
"D = []\n", | |
"for line in open('data.txt'):\n", | |
" pos = line.find(' ')\n", | |
" if pos == -1:\n", | |
" continue\n", | |
" y = int(line[:pos])\n", | |
" x = T2F(line[pos+1:])\n", | |
" D.append(Instance(x, y, line))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(-1, {'humor': 1, 'satisfactorili': 1, 'gender': 1, 'genr': 1, 'exploit': 1, '@bias': 1, 'inher': 1, 'fail': 1, 'polit': 1, 'thrill': 1})" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"D[2]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Training with perceptron" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def training_with_perceptron(D, max_iterations=10):\n", | |
" \"\"\"Training a linear binary classifier with perceptron.\n", | |
" \n", | |
" Args:\n", | |
" D: training set, a list of Instance objects.\n", | |
" max_iterations: the number of iterations.\n", | |
" Returns:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
"\n", | |
" \"\"\"\n", | |
" w = collections.defaultdict(float)\n", | |
" for epoch in range(max_iterations):\n", | |
" random.shuffle(D) # This lazy implementation alters D.\n", | |
" for d in D:\n", | |
" update_perceptron(w, d.x, d.y)\n", | |
" return w" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"w = training_with_perceptron(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-9.0" | |
] | |
}, | |
"execution_count": 41, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, T2F('simplistic , silly and tedious .'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2.0" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"M = sorted(w.iteritems(), key=operator.itemgetter(1))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('snake', -11.0),\n", | |
" ('well-intent', -10.0),\n", | |
" ('generic', -10.0),\n", | |
" ('portion', -10.0),\n", | |
" (\"wasn't\", -10.0),\n", | |
" ('appar', -10.0),\n", | |
" ('unless', -10.0),\n", | |
" ('random', -10.0),\n", | |
" ('wast', -9.0),\n", | |
" ('hospit', -9.0)]" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"M[:10]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('precis', 9.0),\n", | |
" ('conscienc', 9.0),\n", | |
" ('air-condit', 9.0),\n", | |
" ('smith', 9.0),\n", | |
" ('spider-man', 10.0),\n", | |
" ('surreal', 10.0),\n", | |
" ('refresh', 10.0),\n", | |
" ('smarter', 11.0),\n", | |
" ('engross', 11.0),\n", | |
" ('explod', 11.0)]" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"M[-10:]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Closed evaluation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def predict_instance(w, d):\n", | |
" d.label = +1 if dot_product(w, d.x) > 0 else -1\n", | |
"\n", | |
"def predict_all_instances(w, D):\n", | |
" map(functools.partial(predict_instance, w), D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"predict_all_instances(w, D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-1" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"D[0].label" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def num_correct_predictions(D):\n", | |
" return sum(1 for d in D if d.y == d.label)\n", | |
"\n", | |
"def num_true_positives(D):\n", | |
" return sum(1 for d in D if d.y == 1 and d.y == d.label)\n", | |
"\n", | |
"def num_gold_positives(D):\n", | |
" return sum(1 for d in D if d.y == 1)\n", | |
"\n", | |
"def num_predicted_positives(D):\n", | |
" return sum(1 for d in D if d.label == 1)\n", | |
" \n", | |
"def compute_accuracy(D):\n", | |
" return num_correct_predictions(D) / float(len(D))\n", | |
"\n", | |
"def compute_precision(D):\n", | |
" return num_true_positives(D) / float(num_predicted_positives(D))\n", | |
"\n", | |
"def compute_recall(D):\n", | |
" return num_true_positives(D) / float(num_gold_positives(D))\n", | |
"\n", | |
"def compute_f1(D):\n", | |
" p = compute_precision(D)\n", | |
" r = compute_recall(D)\n", | |
" return 2 * p * r / (p + r) if 0 < p + r else 0." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"10197" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"num_correct_predictions(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"10662" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9563871693866066" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_accuracy(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9566441441441441" | |
] | |
}, | |
"execution_count": 53, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_precision(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9561057962858751" | |
] | |
}, | |
"execution_count": 54, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_recall(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9563748944553898" | |
] | |
}, | |
"execution_count": 55, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_f1(D)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Cross validation (open evaluation)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"N = 10\n", | |
"for n in range(N):\n", | |
" train_set = [D[i] for i in range(len(D)) if i % N != n]\n", | |
" test_set = [D[i] for i in range(len(D)) if i % N == n]\n", | |
" w = training_with_perceptron(train_set)\n", | |
" predict_all_instances(w, test_set)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.7177827799662352" | |
] | |
}, | |
"execution_count": 57, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_accuracy(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(0.7309069212410502, 0.6893640967923467, 0.7095279467130032)" | |
] | |
}, | |
"execution_count": 58, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_precision(D), compute_recall(D), compute_f1(D)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Notice that the huge gap exists between the performances of the closed and open evaluations. " | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment