chokkan · October 14, 2016 02:32
diff --git a/binary_classification_perceptron.ipynb b/binary_classification_perceptron.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Binary classification with perceptron"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is available at: https://gist.github.com/chokkan/a962dbf16f070bd0df5d182040a39f3b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Toy example\n",
    "\n",
    "In this notebook, a feature vector $\\phi(x)$ of an instance $x$ is represented by a mapping (dictionary) object `x` whose keys are feature names and values are feature values. In other words, feature indices are represented by any strings rather than by integers. For example,\n",
    "\n",
    "```\n",
    "x = {}\n",
    "x['darling'] = 1\n",
    "x['photo'] = 1\n",
    "x['attach'] = 1\n",
    "```\n",
    "\n",
    "This representation is useful because the feature space for a natural language is high dimensional and sparse. If we define a feature space as occurrences of every word, \n",
    "\n",
    "* the number of the dimension of the feature space ($d$) equals to the total number of words in the language, which typically amounts to 1M words.\n",
    "* although a feature vector is represented by $d$-dimensional vector, most elements in the vector are zero; only a limited number of elements corresponding to the word in a sentence have non-zero values.\n",
    "\n",
    "A binary label $y$ is either `+1` (positive) or `-1` (negative)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import collections\n",
    "import functools\n",
    "import math\n",
    "import operator\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the example used in the lecture, $[(x_1, y_1), (x_2, y_2)]$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Hi darling, my photo in attached file\n",
    "x1 = {'@bias': 1, 'hi_darl':1, 'darl_my':1, 'my_photo':1, 'photo_attach':1, 'attach_file':1}\n",
    "y1 = +1\n",
    "\n",
    "# Hi Mark, Kyoto photo in attached file\n",
    "x2 = {'@bias': 1, 'hi_mark':1, 'mark_kyoto':1, 'kyoto_photo':1, 'photo_attach':1, 'attach_file':1}\n",
    "y2 = -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'@bias': 1,\n",
       " 'attach_file': 1,\n",
       " 'darl_my': 1,\n",
       " 'hi_darl': 1,\n",
       " 'my_photo': 1,\n",
       " 'photo_attach': 1}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'@bias': 1,\n",
       " 'attach_file': 1,\n",
       " 'hi_mark': 1,\n",
       " 'kyoto_photo': 1,\n",
       " 'mark_kyoto': 1,\n",
       " 'photo_attach': 1}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perceptron"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an implementation of the perceptron algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def dot_product(w, x):\n",
    "    \"\"\"Inner product, w \\cdot x.\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector as a mapping object: feature -> value.\n",
    "    Returns:\n",
    "        the inner product, w \\cdot x.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    a = 0.\n",
    "    for f, v in x.iteritems():\n",
    "        a += w.get(f, 0.) * v\n",
    "    return a\n",
    "\n",
    "def predict(w, x):\n",
    "    \"\"\"Predict the label of an instance.\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector as a mapping object: feature -> value.\n",
    "    Returns:\n",
    "        the predicted label: +1 (true) or -1 (false).\n",
    "    \"\"\"\n",
    "    return +1 if dot_product(w, x) >= 0 else -1    \n",
    "\n",
    "def update_perceptron(w, x, y):\n",
    "    \"\"\"Update the model with a training instance (x, y).\n",
    "    \n",
    "    Args:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "        x: feature vector of the training instance as a mapping object.\n",
    "        y: label of the training instance, -1 or +1.\n",
    "\n",
    "    \"\"\"\n",
    "    yp = predict(w, x)\n",
    "    if yp * y < 0:\n",
    "        for f, v in x.iteritems():\n",
    "            w[f] += y * v"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialization\n",
    "We represent the weight vector (model) with a special dictionary object that automatically sets missing values to zero (`collections.defaultdict`). The initial model is empty (no feature weight)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float, {})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w = collections.defaultdict(float)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iteration #1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predict the label $\\hat{y_1}$ for the instance $x_1$. This is a correct prediction ($\\hat{y_1} = y_1 = +1$) because we assume an inner product of zero as a positive label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iteration #2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, predict the label $\\hat{y_2}$ for the instance $x_2$. This is an incorrect prediction ($\\hat{y_2} \\neq y_2 = -1$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the model could not predict the instance $(x_2, y_2)$ correctly, update the feature weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "update_perceptron(w, x2, y2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the current feature weights $w$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float,\n",
       "            {'@bias': -1.0,\n",
       "             'attach_file': -1.0,\n",
       "             'hi_mark': -1.0,\n",
       "             'kyoto_photo': -1.0,\n",
       "             'mark_kyoto': -1.0,\n",
       "             'photo_attach': -1.0})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can confirm that the updated weights can predict the instance $(x_2, y_2)$ correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x2) == y2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-6.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterations #3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predict the label $\\hat{y_1}$ for the instance $x_1$. This is an incorrect prediction ($\\hat{y_1} \\neq y_1 = +1$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-3.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the model could not predict the instance $(x_1, y_1)$ correctly, update the feature weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "update_perceptron(w, x1, y1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the current feature weights $w$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defaultdict(float,\n",
       "            {'@bias': 0.0,\n",
       "             'attach_file': 0.0,\n",
       "             'darl_my': 1.0,\n",
       "             'hi_darl': 1.0,\n",
       "             'hi_mark': -1.0,\n",
       "             'kyoto_photo': -1.0,\n",
       "             'mark_kyoto': -1.0,\n",
       "             'my_photo': 1.0,\n",
       "             'photo_attach': 0.0})"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can confirm that the updated weights can predict the instance $(x_1, y_1)$ correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x1) == y1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, x1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterations #4\n",
    "\n",
    "We can confirm that the current feature weights can can predict the instance $(x_2, y_2)$ correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x2) == y2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterations #5\n",
    "\n",
    "We can confirm that the current feature weights can can predict the instance $(x_1, y_1)$ correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict(w, x1) == y1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sentiment analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us build a sentiment predictor (positive/negative) by using [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz) distributed in [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparing the data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We download the dataset and prepare the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2016-10-14 11:23:54--  http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz\n",
      "Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20\n",
      "Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 487770 (476K) [application/x-gzip]\n",
      "Saving to: ‘rt-polaritydata.tar.gz’\n",
      "\n",
      "100%[======================================>] 487,770      436KB/s   in 1.1s   \n",
      "\n",
      "2016-10-14 11:23:56 (436 KB/s) - ‘rt-polaritydata.tar.gz’ saved [487770/487770]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rt-polaritydata.README.1.0.txt\n",
      "rt-polaritydata/rt-polarity.neg\n",
      "rt-polaritydata/rt-polarity.pos\n"
     ]
    }
   ],
   "source": [
    "!tar xvzf rt-polaritydata.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us check the training instances in the tar-ball."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \r\n",
      "the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . \r\n",
      "effective but too-tepid biopic\r\n",
      "if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \r\n",
      "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \r\n"
     ]
    }
   ],
   "source": [
    "!head -n5 rt-polaritydata/rt-polarity.pos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "simplistic , silly and tedious . \r\n",
      "it's so laddish and juvenile , only teenage boys could possibly find it funny . \r\n",
      "exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \r\n",
      "[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . \r\n",
      "a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . \r\n"
     ]
    }
   ],
   "source": [
    "!head -n5 rt-polaritydata/rt-polarity.neg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge positive and negative instances after inserting '+1' at the beginning of each line in the positive data and '-1' that in the negative data. Sort the order of the instances at random."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sed \"s/^/+1 /g\" rt-polaritydata/rt-polarity.pos > positives.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sed \"s/^/-1 /g\" rt-polaritydata/rt-polarity.neg > negatives.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!sort -R positives.txt negatives.txt > data.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-1 schmaltzy and unfunny , adam sandler's cartoon about hanukkah is numbingly bad , little nicky bad , 10 worst list bad . \r\n",
      "-1 although there are several truly jolting scares , there's also an abundance of hackneyed dialogue and more silly satanic business than you can shake a severed limb at . \r\n",
      "-1 fails to satisfactorily exploit its gender politics , genre thrills or inherent humor . \r\n",
      "-1 it almost feels as if the movie is more interested in entertaining itself than in amusing us . \r\n",
      "-1 it's the type of stunt the academy loves : a powerful political message stuffed into an otherwise mediocre film . \r\n",
      "-1 broder's screenplay is shallow , offensive and redundant , with pitifully few real laughs . \r\n",
      "+1 grant gets to display his cadness to perfection , but also to show acting range that may surprise some who thought light-hearted comedy was his forte . \r\n",
      "+1 as ex-marine walter , who may or may not have shot kennedy , actor raymond j . barry is perfectly creepy and believable . \r\n",
      "-1 it wouldn't be my preferred way of spending 100 minutes or $7 . 00 . \r\n",
      "-1 the picture doesn't know it's a comedy . \r\n"
     ]
    }
   ],
   "source": [
    "!head -n 10 data.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Count the number of positive and negative instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5331\r\n"
     ]
    }
   ],
   "source": [
    "!grep '^+1' data.txt | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5331\r\n"
     ]
    }
   ],
   "source": [
    "!grep '^-1' data.txt | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementing a feature extractor\n",
    "\n",
    "We implement a feature extractor which converts a text into a sparse vector. We use a stop list distributed on the Web."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2016-10-14 11:23:58--  http://www.textfixer.com/resources/common-english-words.txt\n",
      "Resolving www.textfixer.com (www.textfixer.com)... 216.172.105.85\n",
      "Connecting to www.textfixer.com (www.textfixer.com)|216.172.105.85|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 551 [text/plain]\n",
      "Saving to: ‘common-english-words.txt’\n",
      "\n",
      "100%[======================================>] 551         --.-K/s   in 0s      \n",
      "\n",
      "2016-10-14 11:23:58 (54.7 MB/s) - ‘common-english-words.txt’ saved [551/551]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://www.textfixer.com/resources/common-english-words.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your"
     ]
    }
   ],
   "source": [
    "!cat common-english-words.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from stemming.porter2 import stem\n",
    "\n",
    "stoplist = set(open('common-english-words.txt').read().split(','))\n",
    "\n",
    "def is_non_stop(x):\n",
    "    return x not in stoplist\n",
    "\n",
    "def has_alnum(x):\n",
    "    return any((c.isalnum() for c in x))\n",
    "\n",
    "def feature(s):\n",
    "    \"\"\"Feature extractor (from a sequence of words).\n",
    "    \n",
    "    Args:\n",
    "        s: a list of words in a sentence.\n",
    "    Returns:\n",
    "        feature vector as a mapping object: feature -> value.\n",
    "        \n",
    "    \"\"\"\n",
    "    # Remove stop words (find words x \\in s where is_non_stop(x) is True)\n",
    "    x = filter(is_non_stop, s)\n",
    "    # Apply stemming (apply stem(i) for all i \\in x)\n",
    "    x = map(stem, x)\n",
    "    # Remove non alphanumeric words.\n",
    "    x = filter(has_alnum, x)\n",
    "    # Append the bias feature\n",
    "    x.append('@bias')\n",
    "    # Unigram features (the number of occurrences of each word)\n",
    "    return dict(collections.Counter(x))\n",
    "\n",
    "def T2F(text):\n",
    "    \"\"\"Feature extractor (from a natural sentence).\n",
    "    \n",
    "    Args:\n",
    "        text: a sentence.\n",
    "    Returns:\n",
    "        feature vector as a mapping object: feature -> value.\n",
    "    \n",
    "    \"\"\"\n",
    "    return feature(text.lower().split(' '))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us check the feature extractor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'@bias': 1, 'silli': 1, 'simplist': 1, 'tedious': 1}"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "T2F('simplistic , silly and tedious .')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'@bias': 1,\n",
       " 'boy': 1,\n",
       " 'find': 1,\n",
       " 'funni': 1,\n",
       " 'it': 1,\n",
       " 'juvenil': 1,\n",
       " 'laddish': 1,\n",
       " 'possibl': 1,\n",
       " 'teenag': 1}"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "T2F(\"it's so laddish and juvenile , only teenage boys could possibly find it funny . \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the instances in `data.txt` and store each instance in an `Instance` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class Instance:\n",
    "    def __init__(self, x, y, text):\n",
    "        self.x = x\n",
    "        self.y = y\n",
    "        self.text = text\n",
    "    def __repr__(self):\n",
    "        return repr((self.y, self.x))\n",
    "\n",
    "D = []\n",
    "for line in open('data.txt'):\n",
    "    pos = line.find(' ')\n",
    "    if pos == -1:\n",
    "        continue\n",
    "    y = int(line[:pos])\n",
    "    x = T2F(line[pos+1:])\n",
    "    D.append(Instance(x, y, line))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-1, {'humor': 1, 'satisfactorili': 1, 'gender': 1, 'genr': 1, 'exploit': 1, '@bias': 1, 'inher': 1, 'fail': 1, 'polit': 1, 'thrill': 1})"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training with perceptron"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def training_with_perceptron(D, max_iterations=10):\n",
    "    \"\"\"Training a linear binary classifier with perceptron.\n",
    "    \n",
    "    Args:\n",
    "        D: training set, a list of Instance objects.\n",
    "        max_iterations: the number of iterations.\n",
    "    Returns:\n",
    "        w: weight vector (model) as a mapping object: feature -> weight.\n",
    "\n",
    "    \"\"\"\n",
    "    w = collections.defaultdict(float)\n",
    "    for epoch in range(max_iterations):\n",
    "        random.shuffle(D)   # This lazy implementation alters D.\n",
    "        for d in D:\n",
    "            update_perceptron(w, d.x, d.y)\n",
    "    return w"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "w = training_with_perceptron(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-9.0"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, T2F('simplistic , silly and tedious .'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.0"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_product(w, T2F('guaranteed to move anyone who ever shook , rattled , or rolled .'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "M = sorted(w.iteritems(), key=operator.itemgetter(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('snake', -11.0),\n",
       " ('well-intent', -10.0),\n",
       " ('generic', -10.0),\n",
       " ('portion', -10.0),\n",
       " (\"wasn't\", -10.0),\n",
       " ('appar', -10.0),\n",
       " ('unless', -10.0),\n",
       " ('random', -10.0),\n",
       " ('wast', -9.0),\n",
       " ('hospit', -9.0)]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('precis', 9.0),\n",
       " ('conscienc', 9.0),\n",
       " ('air-condit', 9.0),\n",
       " ('smith', 9.0),\n",
       " ('spider-man', 10.0),\n",
       " ('surreal', 10.0),\n",
       " ('refresh', 10.0),\n",
       " ('smarter', 11.0),\n",
       " ('engross', 11.0),\n",
       " ('explod', 11.0)]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "M[-10:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Closed evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def predict_instance(w, d):\n",
    "    d.label = +1 if dot_product(w, d.x) > 0 else -1\n",
    "\n",
    "def predict_all_instances(w, D):\n",
    "    map(functools.partial(predict_instance, w), D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predict_all_instances(w, D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D[0].label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def num_correct_predictions(D):\n",
    "    return sum(1 for d in D if d.y == d.label)\n",
    "\n",
    "def num_true_positives(D):\n",
    "    return sum(1 for d in D if d.y == 1 and d.y == d.label)\n",
    "\n",
    "def num_gold_positives(D):\n",
    "    return sum(1 for d in D if d.y == 1)\n",
    "\n",
    "def num_predicted_positives(D):\n",
    "    return sum(1 for d in D if d.label == 1)\n",
    "    \n",
    "def compute_accuracy(D):\n",
    "    return num_correct_predictions(D) / float(len(D))\n",
    "\n",
    "def compute_precision(D):\n",
    "    return num_true_positives(D) / float(num_predicted_positives(D))\n",
    "\n",
    "def compute_recall(D):\n",
    "    return num_true_positives(D) / float(num_gold_positives(D))\n",
    "\n",
    "def compute_f1(D):\n",
    "    p = compute_precision(D)\n",
    "    r = compute_recall(D)\n",
    "    return 2 * p * r / (p + r) if 0 < p + r else 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10197"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_correct_predictions(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10662"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9563871693866066"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_accuracy(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9566441441441441"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_precision(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9561057962858751"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_recall(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9563748944553898"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_f1(D)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross validation (open evaluation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "N = 10\n",
    "for n in range(N):\n",
    "    train_set = [D[i] for i in range(len(D)) if i % N != n]\n",
    "    test_set = [D[i] for i in range(len(D)) if i % N == n]\n",
    "    w = training_with_perceptron(train_set)\n",
    "    predict_all_instances(w, test_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7177827799662352"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_accuracy(D)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7309069212410502, 0.6893640967923467, 0.7095279467130032)"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_precision(D), compute_recall(D), compute_f1(D)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the huge gap exists between the performances of the closed and open evaluations. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }