Last active
December 19, 2017 18:35
-
-
Save chokkan/6512c39313fa0e471923 to your computer and use it in GitHub Desktop.
Jupyter notebook for classification.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Classification" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this notebook, a feature vector $x$ is represented by a mapping (dictionary) object `x` whose keys are feature names and values are feature values. In other words, feature indices are represented by any strings rather than by integers. For example,\n", | |
"\n", | |
"```\n", | |
"x = {}\n", | |
"x['darling'] = 1\n", | |
"x['photo'] = 1\n", | |
"x['attach'] = 1\n", | |
"```\n", | |
"\n", | |
"This representation is useful because the feature space for a natural language is high dimensional and sparse. If we define a feature space as occurrences of every word, \n", | |
"\n", | |
"* the number of the dimension of the feature space ($m$) equals to the total number of words in the language, which typically amounts to 1M words.\n", | |
"* although a feature vector is represented by $m$-dimensional vector, most elements in the vector are zero; only a limited number of elements corresponding to the word in a sentence have non-zero values.\n", | |
"\n", | |
"A binary label $y$ is either `+1` (positive) or `-1` (negative)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import collections\n", | |
"import functools\n", | |
"import math\n", | |
"import operator\n", | |
"import random" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is the example used in the lecture, $[(x_1, y_1), (x_2, y_2)]$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Hi darling, my photo in attached file\n", | |
"x1 = {'@bias': 1, 'hi_darl':1, 'darl_my':1, 'my_photo':1, 'photo_attach':1, 'attach_file':1}\n", | |
"y1 = +1\n", | |
"\n", | |
"# Hi Mark, Kyoto photo in attached file\n", | |
"x2 = {'@bias': 1, 'hi_mark':1, 'mark_kyoto':1, 'kyoto_photo':1, 'photo_attach':1, 'attach_file':1}\n", | |
"y2 = -1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'@bias': 1,\n", | |
" 'attach_file': 1,\n", | |
" 'darl_my': 1,\n", | |
" 'hi_darl': 1,\n", | |
" 'my_photo': 1,\n", | |
" 'photo_attach': 1}" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"x1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Perceptron" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def dot_product(w, x):\n", | |
" \"\"\"Inner product, w \\cdot x.\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector as a mapping object: feature -> value.\n", | |
" Returns:\n", | |
" the inner product, w \\cdot x.\n", | |
"\n", | |
" \"\"\"\n", | |
"\n", | |
" a = 0.\n", | |
" for f, v in x.iteritems():\n", | |
" a += w.get(f, 0.) * v\n", | |
" return a\n", | |
"\n", | |
"def predict(w, x):\n", | |
" \"\"\"Predict the label of an instance.\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector as a mapping object: feature -> value.\n", | |
" Returns:\n", | |
" the predicted label: +1 (true) or -1 (false).\n", | |
" \"\"\"\n", | |
" return +1 if dot_product(w, x) > 0 else -1 \n", | |
"\n", | |
"def update_perceptron(w, x, y):\n", | |
" \"\"\"Update the model with a training instance (x, y).\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector of the training instance as a mapping object.\n", | |
" y: label of the training instance, -1 or +1.\n", | |
"\n", | |
" \"\"\"\n", | |
" yp = predict(w, x)\n", | |
" if yp * y < 0:\n", | |
" for f, v in x.iteritems():\n", | |
" w[f] += y * v" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The weight vector (model) as a dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float, {})" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"w = collections.defaultdict(float)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Predict the label for the instance $x_1$ (incorrect prediction: this should be $+1$ because $y_1 = +1$)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-1" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The predicted label was negative because the score for the instance is $0$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Update the model with the instance $(x_1, y_1)$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float,\n", | |
" {'@bias': 1.0,\n", | |
" 'attach_file': 1.0,\n", | |
" 'darl_my': 1.0,\n", | |
" 'hi_darl': 1.0,\n", | |
" 'my_photo': 1.0,\n", | |
" 'photo_attach': 1.0})" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"update_perceptron(w, x1, y1)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Predict the label for the instance $x_2$ (incorrect prediction: this should be $-1$ because $y_2 = -1$)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The predicted label was positive because the score for the instance is $3$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3.0" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Update the model with the instance $(x_2, y_2)$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float,\n", | |
" {'@bias': 0.0,\n", | |
" 'attach_file': 0.0,\n", | |
" 'darl_my': 1.0,\n", | |
" 'hi_darl': 1.0,\n", | |
" 'hi_mark': -1.0,\n", | |
" 'kyoto_photo': -1.0,\n", | |
" 'mark_kyoto': -1.0,\n", | |
" 'my_photo': 1.0,\n", | |
" 'photo_attach': 0.0})" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"update_perceptron(w, x2, y2)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us predict labels for instances $x_1$ and $x_2$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3.0" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-1" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"predict(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-3.0" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Logistic regression" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def dot_product(w, x):\n", | |
" \"\"\"Inner product, w \\cdot x.\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector as a mapping object: feature -> value.\n", | |
" Returns:\n", | |
" the inner product, w \\cdot x.\n", | |
"\n", | |
" \"\"\"\n", | |
" a = 0.\n", | |
" for f, v in x.iteritems():\n", | |
" a += w.get(f, 0.) * v\n", | |
" return a\n", | |
"\n", | |
"def probability(w, x):\n", | |
" \"\"\"Compute P(+1|x).\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector as a mapping object: feature -> value.\n", | |
" Returns:\n", | |
" the probability of the instance x being classified as positive.\n", | |
"\n", | |
" \"\"\"\n", | |
" a = dot_product(w, x)\n", | |
" return 1. / (1 + math.exp(-a)) if -100. < a else 0.\n", | |
"\n", | |
"def update_logress(w, x, y, eta=1.0):\n", | |
" \"\"\"Update the model with a training instance (x, y).\n", | |
" \n", | |
" Args:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
" x: feature vector of the training instance as a mapping object.\n", | |
" y: label of the training instance, -1 or +1.\n", | |
" eta: the learning rate for updating the model (default: 1.0).\n", | |
"\n", | |
" \"\"\"\n", | |
"\n", | |
" # Update the model (feature weights) with a training instance (x, y)\n", | |
" y = (y + 1) / 2 # convert {-1,1} to {0,1}\n", | |
" p = probability(w, x)\n", | |
" g = y - p\n", | |
" for f, v in x.iteritems():\n", | |
" w[f] += eta * g * v" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The weight vector (model) as a dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float, {})" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"w = collections.defaultdict(float)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Compute $P(+1|x_1)$ on the initial model." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.5" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"probability(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The probability (0.5) means that the model has no clue for classifying the instance $x_1$ (because the model is empty)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Update the model with the instance $(x_1, y_1)$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float,\n", | |
" {'@bias': 0.5,\n", | |
" 'attach_file': 0.5,\n", | |
" 'darl_my': 0.5,\n", | |
" 'hi_darl': 0.5,\n", | |
" 'my_photo': 0.5,\n", | |
" 'photo_attach': 0.5})" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"update_logress(w, x1, y1)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The weights for the features in the instance $x_1$ are set to $0.5$ based on the ammount of the error, $(y - p) = (1 - 0.5) = 0.5$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Compute $P(+1|x_2)$ on the initial model." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.8175744761936437" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"probability(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The probability $P(+1|x_2)$ should be zero (in other words, $P(-1|x_2) = 1 - P(+1|x_2)$ should be one) because $y_2 = -1$." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Update the model with the instance $(x_2, y_2)$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"defaultdict(float,\n", | |
" {'@bias': -0.31757447619364365,\n", | |
" 'attach_file': -0.31757447619364365,\n", | |
" 'darl_my': 0.5,\n", | |
" 'hi_darl': 0.5,\n", | |
" 'hi_mark': -0.8175744761936437,\n", | |
" 'kyoto_photo': -0.8175744761936437,\n", | |
" 'mark_kyoto': -0.8175744761936437,\n", | |
" 'my_photo': 0.5,\n", | |
" 'photo_attach': -0.31757447619364365})" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"update_logress(w, x2, y2)\n", | |
"w" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The ammount of the error for the instance $x_2$, $(y - p) = (0 - 0.81757...) = -0.81757...$ We can interpret feature weights as follows:\n", | |
"\n", | |
"* 0.5: the feature appears only in $x_1$\n", | |
"* -0.8...: the feature appears only in $x_2$\n", | |
"* -0.3...: the feature appears in both $x_1$ and $x_2$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us predict labels for instances $x_1$ and $x_2$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.6335035042481402" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"probability(w, x1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.032125669946444585" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"probability(w, x2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The both look good, but the classifier is leaning negative because it got a larger error from $x_2$ than that from $x_1$." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Sentiment analysis" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us build a sentiment predictor (positive/negative) by using [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz) distributed in [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preparing the data set" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"First, download the dataset and extract files in the tar-ball (*.tar.gz)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2015-11-20 13:35:13-- http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\n", | |
"Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 128.84.154.137\n", | |
"Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|128.84.154.137|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 487770 (476K) [application/x-gzip]\n", | |
"Saving to: ‘rt-polaritydata.tar.gz’\n", | |
"\n", | |
"100%[======================================>] 487,770 413KB/s in 1.2s \n", | |
"\n", | |
"2015-11-20 13:35:14 (413 KB/s) - ‘rt-polaritydata.tar.gz’ saved [487770/487770]\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"rt-polaritydata.README.1.0.txt\n", | |
"rt-polaritydata/rt-polarity.neg\n", | |
"rt-polaritydata/rt-polarity.pos\n" | |
] | |
} | |
], | |
"source": [ | |
"!tar xvzf rt-polaritydata.tar.gz" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us check the training instances in the tar-ball." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \r\n", | |
"the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . \r\n", | |
"effective but too-tepid biopic\r\n", | |
"if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \r\n", | |
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \r\n" | |
] | |
} | |
], | |
"source": [ | |
"!head -n5 rt-polaritydata/rt-polarity.pos" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"simplistic , silly and tedious . \r\n", | |
"it's so laddish and juvenile , only teenage boys could possibly find it funny . \r\n", | |
"exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \r\n", | |
"[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . \r\n", | |
"a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . \r\n" | |
] | |
} | |
], | |
"source": [ | |
"!head -n5 rt-polaritydata/rt-polarity.neg" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Merge positive and negative instances after inserting '+1' at the beginning of each line in the positive data and '-1' that in the negative data. Sort the order of the instances at random." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!sed \"s/^/+1 /g\" rt-polaritydata/rt-polarity.pos > positives.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!sed \"s/^/-1 /g\" rt-polaritydata/rt-polarity.neg > negatives.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"!sort -R positives.txt negatives.txt > data.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+1 a pleasant enough comedy that should have found a summer place . \r\n", | |
"-1 the only thing in pauline and paulette that you haven't seen before is a scene featuring a football field-sized oriental rug crafted out of millions of vibrant flowers . \r\n", | |
"-1 the problematic characters and overly convenient plot twists foul up shum's good intentions . \r\n", | |
"-1 it will probably prove interesting to ram dass fans , but to others it may feel like a parody of the mellow , peace-and-love side of the '60s counterculture . \r\n", | |
"-1 if all of eight legged freaks was as entertaining as the final hour , i would have no problem giving it an unqualified recommendation . \r\n", | |
"+1 sweetly sexy , funny and touching . \r\n", | |
"-1 the film seems all but destined to pop up on a television screen in the background of a scene in a future quentin tarantino picture\r\n", | |
"+1 while not all that bad of a movie , it's nowhere near as good as the original . \r\n", | |
"+1 it remains to be seen whether statham can move beyond the crime-land action genre , but then again , who says he has to ? \r\n", | |
"-1 tom green and an ivy league college should never appear together on a marquee , especially when the payoff is an unschooled comedy like stealing harvard , which fails to keep 80 minutes from seeming like 800 . \r\n" | |
] | |
} | |
], | |
"source": [ | |
"!head -n 10 data.txt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Count the number of positive and negative instances." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"5331\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!grep '^+1' data.txt | wc -l" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"5331\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!grep '^-1' data.txt | wc -l" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Implementing a feature extractor" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We use a stop list distributed on the Web." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"--2015-11-20 13:35:17-- http://www.textfixer.com/resources/common-english-words.txt\n", | |
"Resolving www.textfixer.com (www.textfixer.com)... 216.172.104.5\n", | |
"Connecting to www.textfixer.com (www.textfixer.com)|216.172.104.5|:80... connected.\n", | |
"HTTP request sent, awaiting response... 200 OK\n", | |
"Length: 551 [text/plain]\n", | |
"Saving to: ‘common-english-words.txt’\n", | |
"\n", | |
"100%[======================================>] 551 --.-K/s in 0s \n", | |
"\n", | |
"2015-11-20 13:35:17 (67.9 MB/s) - ‘common-english-words.txt’ saved [551/551]\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"!wget http://www.textfixer.com/resources/common-english-words.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your" | |
] | |
} | |
], | |
"source": [ | |
"!cat common-english-words.txt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from stemming.porter2 import stem\n", | |
"\n", | |
"stoplist = set(open('common-english-words.txt').read().split(','))\n", | |
"\n", | |
"def is_non_stop(x):\n", | |
" return x not in stoplist\n", | |
"\n", | |
"def has_alnum(x):\n", | |
" return any((c.isalnum() for c in x))\n", | |
"\n", | |
"def feature(s):\n", | |
" \"\"\"Feature extractor (from a sequence of words).\n", | |
" \n", | |
" Args:\n", | |
" s: a list of words in a sentence.\n", | |
" Returns:\n", | |
" feature vector as a mapping object: feature -> value.\n", | |
" \n", | |
" \"\"\"\n", | |
" # Remove stop words (find words x \\in s where is_non_stop(x) is True)\n", | |
" x = filter(is_non_stop, s)\n", | |
" # Apply stemming (apply stem(i) for all i \\in x)\n", | |
" x = map(stem, x)\n", | |
" # Remove non alphanumeric words.\n", | |
" x = filter(has_alnum, x)\n", | |
" # Append the bias feature\n", | |
" x.append('@bias')\n", | |
" # Unigram features (the number of occurrences of each word)\n", | |
" return collections.Counter(x)\n", | |
"\n", | |
"def T2F(text):\n", | |
" \"\"\"Feature extractor (from a natural sentence).\n", | |
" \n", | |
" Args:\n", | |
" text: a sentence.\n", | |
" Returns:\n", | |
" feature vector as a mapping object: feature -> value.\n", | |
" \n", | |
" \"\"\"\n", | |
" return feature(text.lower().split(' '))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us check the feature extractor." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Counter({'@bias': 1, 'silli': 1, 'simplist': 1, 'tedious': 1})" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"T2F('simplistic , silly and tedious .')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Counter({'@bias': 1,\n", | |
" 'boy': 1,\n", | |
" 'find': 1,\n", | |
" 'funni': 1,\n", | |
" 'it': 1,\n", | |
" 'juvenil': 1,\n", | |
" 'laddish': 1,\n", | |
" 'possibl': 1,\n", | |
" 'teenag': 1})" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"T2F(\"it's so laddish and juvenile , only teenage boys could possibly find it funny . \")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Load the data set" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Read the instances in `data.txt` and store each instance in an `Instance` object." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"class Instance:\n", | |
" def __init__(self, x, y, text):\n", | |
" self.x = x\n", | |
" self.y = y\n", | |
" self.text = text\n", | |
" def __repr__(self):\n", | |
" return repr((self.y, self.x))\n", | |
"\n", | |
"D = []\n", | |
"for line in open('data.txt'):\n", | |
" pos = line.find(' ')\n", | |
" if pos == -1:\n", | |
" continue\n", | |
" y = int(line[:pos])\n", | |
" x = T2F(line[pos+1:])\n", | |
" D.append(Instance(x, y, line))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(1, Counter({'summer': 1, 'comedi': 1, '@bias': 1, 'pleasant': 1, 'enough': 1, 'place': 1, 'found': 1}))" | |
] | |
}, | |
"execution_count": 40, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"D[0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Training with perceptron" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def training_with_perceptron(D, max_iterations=10):\n", | |
" \"\"\"Training a linear binary classifier with perceptron.\n", | |
" \n", | |
" Args:\n", | |
" D: training set, a list of Instance objects.\n", | |
" max_iterations: the number of iterations.\n", | |
" Returns:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
"\n", | |
" \"\"\"\n", | |
" w = collections.defaultdict(float)\n", | |
" for epoch in range(max_iterations):\n", | |
" random.shuffle(D) # This lazy implementation alters D.\n", | |
" for d in D:\n", | |
" update_perceptron(w, d.x, d.y)\n", | |
" return w" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"w = training_with_perceptron(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"-8.0" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, T2F('simplistic , silly and tedious .'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2.0" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"dot_product(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"M = sorted(w.iteritems(), key=operator.itemgetter(1))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('appar', -12.0),\n", | |
" ('snake', -11.0),\n", | |
" ('well-intent', -11.0),\n", | |
" ('unless', -11.0),\n", | |
" ('schneider', -10.0),\n", | |
" ('prettiest', -10.0),\n", | |
" ('demm', -10.0),\n", | |
" ('incoher', -10.0),\n", | |
" ('ballist', -10.0),\n", | |
" ('purport', -9.0)]" | |
] | |
}, | |
"execution_count": 46, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"M[:10]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('explod', 9.0),\n", | |
" ('glorious', 9.0),\n", | |
" ('frailti', 9.0),\n", | |
" ('resist', 9.0),\n", | |
" ('smith', 9.0),\n", | |
" ('confid', 10.0),\n", | |
" ('tape', 10.0),\n", | |
" ('optimist', 10.0),\n", | |
" ('refresh', 11.0),\n", | |
" ('engross', 13.0)]" | |
] | |
}, | |
"execution_count": 47, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"M[-10:]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Training with logistic regression" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def training_with_logistic_regression(D, max_iterations=10, eta0=0.25):\n", | |
" \"\"\"Training a linear binary classifier with logistic regression.\n", | |
" \n", | |
" Args:\n", | |
" D: training set, a list of Instance objects.\n", | |
" max_iterations: the number of iterations.\n", | |
" eta0: the initial learning rate.\n", | |
" Returns:\n", | |
" w: weight vector (model) as a mapping object: feature -> weight.\n", | |
"\n", | |
" \"\"\"\n", | |
" t = 0\n", | |
" T = len(D) * max_iterations\n", | |
" w = collections.defaultdict(float)\n", | |
" for epoch in range(max_iterations):\n", | |
" random.shuffle(D) # This lazy implementation alters D.\n", | |
" for d in D:\n", | |
" eta = eta0 * (1 - t / (T+1))\n", | |
" update_logress(w, d.x, d.y, eta)\n", | |
" return w" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"w = training_with_logistic_regression(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0049367426713377415" | |
] | |
}, | |
"execution_count": 50, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"probability(w, T2F('simplistic , silly and tedious .'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.7515928142237298" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"probability(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"M = sorted(w.iteritems(), key=operator.itemgetter(1))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('bore', -3.8791828412417098),\n", | |
" ('unless', -3.7829369145348104),\n", | |
" ('appar', -3.758309568480576),\n", | |
" ('wast', -3.6960486949296003),\n", | |
" ('snake', -3.6184562166109298),\n", | |
" ('mediocr', -3.553742416416723),\n", | |
" ('routin', -3.4720750656289727),\n", | |
" (\"wasn't\", -3.42828223524663),\n", | |
" ('incoher', -3.393891755683168),\n", | |
" ('generic', -3.387698473383638)]" | |
] | |
}, | |
"execution_count": 53, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"M[:10]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('optimist', 3.1985022632691846),\n", | |
" ('lane', 3.2043386515334706),\n", | |
" ('examin', 3.2555649003689764),\n", | |
" ('confid', 3.3621029766188024),\n", | |
" ('resist', 3.399846251977437),\n", | |
" ('unexpect', 3.468014876606892),\n", | |
" ('smarter', 3.7142599381159553),\n", | |
" ('glorious', 3.8256576447392465),\n", | |
" ('refresh', 4.131514135332804),\n", | |
" ('engross', 4.666381021792194)]" | |
] | |
}, | |
"execution_count": 54, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"M[-10:]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Closed evaluation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def predict(w, d):\n", | |
" d.label = +1 if dot_product(w, d.x) > 0 else -1\n", | |
"\n", | |
"def predict_all(w, D):\n", | |
" map(functools.partial(predict, w), D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"predict_all(w, D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1" | |
] | |
}, | |
"execution_count": 57, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"D[0].label" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 58, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def num_correct_predictions(D):\n", | |
" return sum(1 for d in D if d.y == d.label)\n", | |
"\n", | |
"def num_true_positives(D):\n", | |
" return sum(1 for d in D if d.y == 1 and d.y == d.label)\n", | |
"\n", | |
"def num_gold_positives(D):\n", | |
" return sum(1 for d in D if d.y == 1)\n", | |
"\n", | |
"def num_predicted_positives(D):\n", | |
" return sum(1 for d in D if d.label == 1)\n", | |
" \n", | |
"def compute_accuracy(D):\n", | |
" return num_correct_predictions(D) / float(len(D))\n", | |
"\n", | |
"def compute_precision(D):\n", | |
" return num_true_positives(D) / float(num_predicted_positives(D))\n", | |
"\n", | |
"def compute_recall(D):\n", | |
" return num_true_positives(D) / float(num_gold_positives(D))\n", | |
"\n", | |
"def compute_f1(D):\n", | |
" p = compute_precision(D)\n", | |
" r = compute_recall(D)\n", | |
" return 2 * p * r / (p + r) if 0 < p + r else 0." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 59, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"10344" | |
] | |
}, | |
"execution_count": 59, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"num_correct_predictions(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 60, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"10662" | |
] | |
}, | |
"execution_count": 60, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 61, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9701744513224536" | |
] | |
}, | |
"execution_count": 61, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_accuracy(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 62, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9881207400194741" | |
] | |
}, | |
"execution_count": 62, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_precision(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 63, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9517914087413243" | |
] | |
}, | |
"execution_count": 63, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_recall(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 64, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.9696158991018535" | |
] | |
}, | |
"execution_count": 64, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_f1(D)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Cross validation (open evaluation)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 65, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"N = 10\n", | |
"for n in range(N):\n", | |
" train_set = [D[i] for i in range(len(D)) if i % N != n]\n", | |
" test_set = [D[i] for i in range(len(D)) if i % N == n]\n", | |
" w = training_with_logistic_regression(train_set)\n", | |
" predict_all(w, test_set)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 66, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.7481710748452448" | |
] | |
}, | |
"execution_count": 66, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_accuracy(D)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 67, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(0.7451816160118606, 0.7542674920277621, 0.7496970261955812)" | |
] | |
}, | |
"execution_count": 67, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"compute_precision(D), compute_recall(D), compute_f1(D)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment