Skip to content

Instantly share code, notes, and snippets.

@chokkan
Last active October 14, 2016 02:32
Show Gist options
  • Save chokkan/a962dbf16f070bd0df5d182040a39f3b to your computer and use it in GitHub Desktop.
Save chokkan/a962dbf16f070bd0df5d182040a39f3b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Binary classification with perceptron"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook is available at: https://gist.github.com/chokkan/a962dbf16f070bd0df5d182040a39f3b"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Toy example\n",
"\n",
"In this notebook, a feature vector $\\phi(x)$ of an instance $x$ is represented by a mapping (dictionary) object `x` whose keys are feature names and values are feature values. In other words, feature indices are represented by any strings rather than by integers. For example,\n",
"\n",
"```\n",
"x = {}\n",
"x['darling'] = 1\n",
"x['photo'] = 1\n",
"x['attach'] = 1\n",
"```\n",
"\n",
"This representation is useful because the feature space for a natural language is high dimensional and sparse. If we define a feature space as occurrences of every word, \n",
"\n",
"* the number of the dimension of the feature space ($d$) equals to the total number of words in the language, which typically amounts to 1M words.\n",
"* although a feature vector is represented by $d$-dimensional vector, most elements in the vector are zero; only a limited number of elements corresponding to the word in a sentence have non-zero values.\n",
"\n",
"A binary label $y$ is either `+1` (positive) or `-1` (negative)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import collections\n",
"import functools\n",
"import math\n",
"import operator\n",
"import random"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the example used in the lecture, $[(x_1, y_1), (x_2, y_2)]$."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Hi darling, my photo in attached file\n",
"x1 = {'@bias': 1, 'hi_darl':1, 'darl_my':1, 'my_photo':1, 'photo_attach':1, 'attach_file':1}\n",
"y1 = +1\n",
"\n",
"# Hi Mark, Kyoto photo in attached file\n",
"x2 = {'@bias': 1, 'hi_mark':1, 'mark_kyoto':1, 'kyoto_photo':1, 'photo_attach':1, 'attach_file':1}\n",
"y2 = -1"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'@bias': 1,\n",
" 'attach_file': 1,\n",
" 'darl_my': 1,\n",
" 'hi_darl': 1,\n",
" 'my_photo': 1,\n",
" 'photo_attach': 1}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'@bias': 1,\n",
" 'attach_file': 1,\n",
" 'hi_mark': 1,\n",
" 'kyoto_photo': 1,\n",
" 'mark_kyoto': 1,\n",
" 'photo_attach': 1}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perceptron"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an implementation of the perceptron algorithm."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def dot_product(w, x):\n",
" \"\"\"Inner product, w \\cdot x.\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector as a mapping object: feature -> value.\n",
" Returns:\n",
" the inner product, w \\cdot x.\n",
"\n",
" \"\"\"\n",
"\n",
" a = 0.\n",
" for f, v in x.iteritems():\n",
" a += w.get(f, 0.) * v\n",
" return a\n",
"\n",
"def predict(w, x):\n",
" \"\"\"Predict the label of an instance.\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector as a mapping object: feature -> value.\n",
" Returns:\n",
" the predicted label: +1 (true) or -1 (false).\n",
" \"\"\"\n",
" return +1 if dot_product(w, x) >= 0 else -1 \n",
"\n",
"def update_perceptron(w, x, y):\n",
" \"\"\"Update the model with a training instance (x, y).\n",
" \n",
" Args:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
" x: feature vector of the training instance as a mapping object.\n",
" y: label of the training instance, -1 or +1.\n",
"\n",
" \"\"\"\n",
" yp = predict(w, x)\n",
" if yp * y < 0:\n",
" for f, v in x.iteritems():\n",
" w[f] += y * v"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialization\n",
"We represent the weight vector (model) with a special dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature weight)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float, {})"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"w = collections.defaultdict(float)\n",
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iteration #1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predict the label $\\hat{y_1}$ for the instance $x_1$. This is a correct prediction ($\\hat{y_1} = y_1 = +1$) because we assume an inner product of zero as a positive label."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iteration #2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, predict the label $\\hat{y_2}$ for the instance $x_2$. This is an incorrect prediction ($\\hat{y_2} \\neq y_2 = -1$)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x2)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because the model could not predict the instance $(x_2, y_2)$ correctly, update the feature weights."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"update_perceptron(w, x2, y2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the current feature weights $w$."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float,\n",
" {'@bias': -1.0,\n",
" 'attach_file': -1.0,\n",
" 'hi_mark': -1.0,\n",
" 'kyoto_photo': -1.0,\n",
" 'mark_kyoto': -1.0,\n",
" 'photo_attach': -1.0})"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can confirm that the updated weights can predict the instance $(x_2, y_2)$ correctly."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x2) == y2"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-6.0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterations #3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predict the label $\\hat{y_1}$ for the instance $x_1$. This is an incorrect prediction ($\\hat{y_1} \\neq y_1 = +1$)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x1)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-3.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because the model could not predict the instance $(x_1, y_1)$ correctly, update the feature weights."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"update_perceptron(w, x1, y1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the current feature weights $w$."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(float,\n",
" {'@bias': 0.0,\n",
" 'attach_file': 0.0,\n",
" 'darl_my': 1.0,\n",
" 'hi_darl': 1.0,\n",
" 'hi_mark': -1.0,\n",
" 'kyoto_photo': -1.0,\n",
" 'mark_kyoto': -1.0,\n",
" 'my_photo': 1.0,\n",
" 'photo_attach': 0.0})"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"w"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can confirm that the updated weights can predict the instance $(x_1, y_1)$ correctly."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x1) == y1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3.0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterations #4\n",
"\n",
"We can confirm that the current feature weights can can predict the instance $(x_2, y_2)$ correctly."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x2) == y2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterations #5\n",
"\n",
"We can confirm that the current feature weights can can predict the instance $(x_1, y_1)$ correctly."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict(w, x1) == y1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sentiment analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us build a sentiment predictor (positive/negative) by using [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz) distributed in [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preparing the data set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We download the dataset and prepare the training data."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2016-10-14 11:23:54-- http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\n",
"Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20\n",
"Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 487770 (476K) [application/x-gzip]\n",
"Saving to: ‘rt-polaritydata.tar.gz’\n",
"\n",
"100%[======================================>] 487,770 436KB/s in 1.1s \n",
"\n",
"2016-10-14 11:23:56 (436 KB/s) - ‘rt-polaritydata.tar.gz’ saved [487770/487770]\n",
"\n"
]
}
],
"source": [
"!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rt-polaritydata.README.1.0.txt\n",
"rt-polaritydata/rt-polarity.neg\n",
"rt-polaritydata/rt-polarity.pos\n"
]
}
],
"source": [
"!tar xvzf rt-polaritydata.tar.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us check the training instances in the tar-ball."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \r\n",
"the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . \r\n",
"effective but too-tepid biopic\r\n",
"if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \r\n",
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \r\n"
]
}
],
"source": [
"!head -n5 rt-polaritydata/rt-polarity.pos"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"simplistic , silly and tedious . \r\n",
"it's so laddish and juvenile , only teenage boys could possibly find it funny . \r\n",
"exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \r\n",
"[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . \r\n",
"a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . \r\n"
]
}
],
"source": [
"!head -n5 rt-polaritydata/rt-polarity.neg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Merge positive and negative instances after inserting '+1' at the beginning of each line in the positive data and '-1' that in the negative data. Sort the order of the instances at random."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!sed \"s/^/+1 /g\" rt-polaritydata/rt-polarity.pos > positives.txt"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!sed \"s/^/-1 /g\" rt-polaritydata/rt-polarity.neg > negatives.txt"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!sort -R positives.txt negatives.txt > data.txt"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-1 schmaltzy and unfunny , adam sandler's cartoon about hanukkah is numbingly bad , little nicky bad , 10 worst list bad . \r\n",
"-1 although there are several truly jolting scares , there's also an abundance of hackneyed dialogue and more silly satanic business than you can shake a severed limb at . \r\n",
"-1 fails to satisfactorily exploit its gender politics , genre thrills or inherent humor . \r\n",
"-1 it almost feels as if the movie is more interested in entertaining itself than in amusing us . \r\n",
"-1 it's the type of stunt the academy loves : a powerful political message stuffed into an otherwise mediocre film . \r\n",
"-1 broder's screenplay is shallow , offensive and redundant , with pitifully few real laughs . \r\n",
"+1 grant gets to display his cadness to perfection , but also to show acting range that may surprise some who thought light-hearted comedy was his forte . \r\n",
"+1 as ex-marine walter , who may or may not have shot kennedy , actor raymond j . barry is perfectly creepy and believable . \r\n",
"-1 it wouldn't be my preferred way of spending 100 minutes or $7 . 00 . \r\n",
"-1 the picture doesn't know it's a comedy . \r\n"
]
}
],
"source": [
"!head -n 10 data.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of positive and negative instances."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5331\r\n"
]
}
],
"source": [
"!grep '^+1' data.txt | wc -l"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5331\r\n"
]
}
],
"source": [
"!grep '^-1' data.txt | wc -l"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implementing a feature extractor\n",
"\n",
"We implement a feature extractor which converts a text into a sparse vector. We use a stop list distributed on the Web."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2016-10-14 11:23:58-- http://www.textfixer.com/resources/common-english-words.txt\n",
"Resolving www.textfixer.com (www.textfixer.com)... 216.172.105.85\n",
"Connecting to www.textfixer.com (www.textfixer.com)|216.172.105.85|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 551 [text/plain]\n",
"Saving to: ‘common-english-words.txt’\n",
"\n",
"100%[======================================>] 551 --.-K/s in 0s \n",
"\n",
"2016-10-14 11:23:58 (54.7 MB/s) - ‘common-english-words.txt’ saved [551/551]\n",
"\n"
]
}
],
"source": [
"!wget http://www.textfixer.com/resources/common-english-words.txt"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your"
]
}
],
"source": [
"!cat common-english-words.txt"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from stemming.porter2 import stem\n",
"\n",
"stoplist = set(open('common-english-words.txt').read().split(','))\n",
"\n",
"def is_non_stop(x):\n",
" return x not in stoplist\n",
"\n",
"def has_alnum(x):\n",
" return any((c.isalnum() for c in x))\n",
"\n",
"def feature(s):\n",
" \"\"\"Feature extractor (from a sequence of words).\n",
" \n",
" Args:\n",
" s: a list of words in a sentence.\n",
" Returns:\n",
" feature vector as a mapping object: feature -> value.\n",
" \n",
" \"\"\"\n",
" # Remove stop words (find words x \\in s where is_non_stop(x) is True)\n",
" x = filter(is_non_stop, s)\n",
" # Apply stemming (apply stem(i) for all i \\in x)\n",
" x = map(stem, x)\n",
" # Remove non alphanumeric words.\n",
" x = filter(has_alnum, x)\n",
" # Append the bias feature\n",
" x.append('@bias')\n",
" # Unigram features (the number of occurrences of each word)\n",
" return dict(collections.Counter(x))\n",
"\n",
"def T2F(text):\n",
" \"\"\"Feature extractor (from a natural sentence).\n",
" \n",
" Args:\n",
" text: a sentence.\n",
" Returns:\n",
" feature vector as a mapping object: feature -> value.\n",
" \n",
" \"\"\"\n",
" return feature(text.lower().split(' '))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us check the feature extractor."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'@bias': 1, 'silli': 1, 'simplist': 1, 'tedious': 1}"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"T2F('simplistic , silly and tedious .')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'@bias': 1,\n",
" 'boy': 1,\n",
" 'find': 1,\n",
" 'funni': 1,\n",
" 'it': 1,\n",
" 'juvenil': 1,\n",
" 'laddish': 1,\n",
" 'possibl': 1,\n",
" 'teenag': 1}"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"T2F(\"it's so laddish and juvenile , only teenage boys could possibly find it funny . \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the data set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the instances in `data.txt` and store each instance in an `Instance` object."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class Instance:\n",
" def __init__(self, x, y, text):\n",
" self.x = x\n",
" self.y = y\n",
" self.text = text\n",
" def __repr__(self):\n",
" return repr((self.y, self.x))\n",
"\n",
"D = []\n",
"for line in open('data.txt'):\n",
" pos = line.find(' ')\n",
" if pos == -1:\n",
" continue\n",
" y = int(line[:pos])\n",
" x = T2F(line[pos+1:])\n",
" D.append(Instance(x, y, line))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(-1, {'humor': 1, 'satisfactorili': 1, 'gender': 1, 'genr': 1, 'exploit': 1, '@bias': 1, 'inher': 1, 'fail': 1, 'polit': 1, 'thrill': 1})"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training with perceptron"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def training_with_perceptron(D, max_iterations=10):\n",
" \"\"\"Training a linear binary classifier with perceptron.\n",
" \n",
" Args:\n",
" D: training set, a list of Instance objects.\n",
" max_iterations: the number of iterations.\n",
" Returns:\n",
" w: weight vector (model) as a mapping object: feature -> weight.\n",
"\n",
" \"\"\"\n",
" w = collections.defaultdict(float)\n",
" for epoch in range(max_iterations):\n",
" random.shuffle(D) # This lazy implementation alters D.\n",
" for d in D:\n",
" update_perceptron(w, d.x, d.y)\n",
" return w"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"w = training_with_perceptron(D)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-9.0"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, T2F('simplistic , silly and tedious .'))"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.0"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_product(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"M = sorted(w.iteritems(), key=operator.itemgetter(1))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('snake', -11.0),\n",
" ('well-intent', -10.0),\n",
" ('generic', -10.0),\n",
" ('portion', -10.0),\n",
" (\"wasn't\", -10.0),\n",
" ('appar', -10.0),\n",
" ('unless', -10.0),\n",
" ('random', -10.0),\n",
" ('wast', -9.0),\n",
" ('hospit', -9.0)]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M[:10]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('precis', 9.0),\n",
" ('conscienc', 9.0),\n",
" ('air-condit', 9.0),\n",
" ('smith', 9.0),\n",
" ('spider-man', 10.0),\n",
" ('surreal', 10.0),\n",
" ('refresh', 10.0),\n",
" ('smarter', 11.0),\n",
" ('engross', 11.0),\n",
" ('explod', 11.0)]"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Closed evaluation"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def predict_instance(w, d):\n",
" d.label = +1 if dot_product(w, d.x) > 0 else -1\n",
"\n",
"def predict_all_instances(w, D):\n",
" map(functools.partial(predict_instance, w), D)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"predict_all_instances(w, D)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D[0].label"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def num_correct_predictions(D):\n",
" return sum(1 for d in D if d.y == d.label)\n",
"\n",
"def num_true_positives(D):\n",
" return sum(1 for d in D if d.y == 1 and d.y == d.label)\n",
"\n",
"def num_gold_positives(D):\n",
" return sum(1 for d in D if d.y == 1)\n",
"\n",
"def num_predicted_positives(D):\n",
" return sum(1 for d in D if d.label == 1)\n",
" \n",
"def compute_accuracy(D):\n",
" return num_correct_predictions(D) / float(len(D))\n",
"\n",
"def compute_precision(D):\n",
" return num_true_positives(D) / float(num_predicted_positives(D))\n",
"\n",
"def compute_recall(D):\n",
" return num_true_positives(D) / float(num_gold_positives(D))\n",
"\n",
"def compute_f1(D):\n",
" p = compute_precision(D)\n",
" r = compute_recall(D)\n",
" return 2 * p * r / (p + r) if 0 < p + r else 0."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10197"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_correct_predictions(D)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10662"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(D)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9563871693866066"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_accuracy(D)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9566441441441441"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_precision(D)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9561057962858751"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_recall(D)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.9563748944553898"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_f1(D)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cross validation (open evaluation)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"N = 10\n",
"for n in range(N):\n",
" train_set = [D[i] for i in range(len(D)) if i % N != n]\n",
" test_set = [D[i] for i in range(len(D)) if i % N == n]\n",
" w = training_with_perceptron(train_set)\n",
" predict_all_instances(w, test_set)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.7177827799662352"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_accuracy(D)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(0.7309069212410502, 0.6893640967923467, 0.7095279467130032)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compute_precision(D), compute_recall(D), compute_f1(D)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the huge gap exists between the performances of the closed and open evaluations. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment