Skip to content

Instantly share code, notes, and snippets.

@kylebgorman
Created July 19, 2017 20:43
Show Gist options
  • Save kylebgorman/1c9eeb9be2a0d7b13f2d4b2a17ec9e3a to your computer and use it in GitHub Desktop.
Save kylebgorman/1c9eeb9be2a0d7b13f2d4b2a17ec9e3a to your computer and use it in GitHub Desktop.
Lecture on simple linear models for NLP applications, given to the Portland PUG sometime in late 2014 or early 2015
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:c93276c0ddfebaad66c167cd4f918890a7ff5156a6ed33bd9b28d08ec81c9807"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Simple linear models for natural language processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" In NLP, we are constantly faced with the problem of making decisions in the face of competing cues of varying or unknown reliability. The traditional approach to this problem proceeds as follows:\n",
"\n",
"* _Probability density estimation_: for all possible decisions $y$, estimate $p(y~|~\\phi)$, the probability of this decision given cues $\\phi$\n",
"* _Thresholding_: pick the outcome $\\hat{y}$ which maximizes the conditional probability given observations $\\phi$:\n",
"\n",
"$$\\hat{y}~=~\\underset{y~\\in~Y}{\\operatorname{argmax}}~p(y~|~\\phi)~.$$\n",
"\n",
"This idea is essential to a wide variety of machine learning approaches, including _na\u00efve Bayes classifiers_ and _logistic regression_.\n",
"\n",
"In the 1980s and 1990s, alternative approaches were proposed under the banner of _statistical learning theory_. In this framework, the goal is not to estimate a probability density function (per se) but rather to learn a _classification function_ $f(\\phi)$ which predicts $y$. There are many learning algorithms which are guaranteed to learn a classification function which is _probably approximately correct_ (PAC). That is, if enough data is provided, the algorithm will (_probably_, i.e., with very high probability) predict $y$ with minimal error (i.e., it will be _approximately correct_). Learning this classification function is almost always easier than probability density estimation. One such class of learning algorithms is designed for _linear models_ (or _linear classifiers_), conceptually simple, **Unreasonably Effective** tools used extensively in natural language processing (NLP)."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Linear classifiers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the simplest types of classification functions is a _binary linear classifier_, which makes a binary classification decision based on the value of a linear combination (i.e., a weighted sum) of feature values, as follows.\n",
"\n",
"* A _label_ $y$ (i.e., the thing to be predicted) is a binary value in $\\{-1, +1\\}^n$\n",
"* A _feature vector_ $\\phi$ is a real-valued vector\n",
"* An _observation_ is a $(y, \\phi)$ tuple\n",
"* The classifier itself is defined a real-valued _weight vector_ $w$\n",
"\n",
"Given a classifier with weight vector $w$, the _predicted label_ for an observation is defined as \n",
"\n",
"$$\\hat{y} = \\begin{cases} +1 & \\text{if } w~\\cdot~\\phi > 0 \\\\ -1 & \\text{otherwise}\\end{cases}$$\n",
"\n",
"where $w~\\cdot~\\phi$ is the dot product (here, a weighted sum) of the weights and features. \n",
"\n",
"Without loss of generality, we assume that there is also a \"bias\" feature present in all feature vectors $\\phi$ and that there is a corresponding weight in $w$ (call it $w_b$), which acts much like an _intercept_ in linear regression.\n",
"\n",
"Geometrically speaking, the _decision boundary_ lies orthogonal to $w$."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%pylab inline\n",
"from sklearn import datasets\n",
"from numpy import arange, array, dot, linspace, meshgrid, \\\n",
" ones, sign, vectorize, vstack\n",
"\n",
"# the famous iris data set (Fisher 1936), but throwing\n",
"# out the third species, for sake of simplicity\n",
"iris = datasets.load_iris()\n",
"Y = iris.target\n",
"mask = iris.target != 2\n",
"Y = iris.target[mask]\n",
"Phi = iris.data[mask, :2]\n",
"\n",
"## a decision boundary, fit by hand\n",
"w = array([-3., 1.2, -1.1])\n",
"\n",
"## plot decision boundary\n",
"# create prediction mesh\n",
"h = .02\n",
"x_min = Phi[:, 0].min() - 1\n",
"x_max = Phi[:, 0].max() + 1\n",
"y_min = Phi[:, 1].min() - 1\n",
"y_max = Phi[:, 1].max() + 1\n",
"(xx, yy) = meshgrid(arange(x_min, x_max, h),\n",
" arange(y_min, y_max, h))\n",
"Phi_mesh = vstack((ones(xx.size),\n",
" xx.ravel(), yy.ravel()))\n",
"Z = sign(dot(w, Phi_mesh)).reshape(xx.shape)\n",
"# plot mesh\n",
"contourf(xx, yy, Z, alpha=0.1)\n",
"\n",
"## plot points\n",
"scatter(Phi[:, 0], Phi[:, 1], c=Y, cmap=cm.Paired)\n",
"xlabel(\"Sepal length\")\n",
"ylabel(\"Sepal width\")\n",
"xlim(xx.min(), xx.max())\n",
"ylim(yy.min(), yy.max())\n",
"\n",
"## plot $w$\n",
"# a phi on the decision boundary\n",
"index = tuple(vstack(nonzero(Z == 0.))[:, -2])\n",
"arrow(xx[index], yy[index], w[1], w[2],\n",
" head_width=.1, color=\"k\")\n",
"annotate(\"W\", xy=(xx[index] + w[1] + .1, yy[index] + w[2] + .1))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"<matplotlib.text.Annotation at 0x10e2cbc18>"
]
},
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAEKCAYAAAD6q1UVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8zPcfwPHXJ3sviXVGRBB7K7EVtandGq1RqmiV6qCl\nflqlLapVpUu1lNq0ZlE7dhAiIVZkILL3uPv8/rhIQ0lOksvg83w8PH53913vu1/zvu99xvsjpJQo\niqIozxaTog5AURRFKXwq+SuKojyDVPJXFEV5BqnkryiK8gxSyV9RFOUZpJK/oijKM8isqAMwhBBC\njUdVFEXJAymleNTrJSL5A4SGphZ1CPk2f/5spkz5qKjDKLbU5/N46rPJWWF9Pn6RYBqyGntbMxp4\ntjf69fLLSqN57LYSk/wVRVGKkm+8L5YhATTTWGPp2qqow8k3lfwVRVFy4Rvvi+W1AFpXrgtOZYs6\nnAKhkn8hatGiTVGHUKypz+fx1GeTM2N+PrejwTI4gNaask9N4gcQJaG2jxBCPg1t/oqilCw+sSk4\n3tiExs0Mj/LFv43/YVYaTcnv8FUURSlM/mG7cYyI1N/xu9Yt6nAKnEr+iqIo2Twwoqd+p6IOx2hU\n8lcURcn0tI3oyYlK/oqiPPNuR0NU8Gos4aka0ZMTlfwVRXmm+SSG4Rh8oMR26uaVSv6KojyznvZO\n3Zyo5K8oyjPnWenUzYlK/oqiPFOepU7dnKjkryjKM+FZ7NTNidGTvxDiBhAHaIF0KWWzR+zzNdAV\nSAJelVL6GjsuRVGeHT6xKTgGb6KWxoxSrs9Op25OCuPOXwLtpJRRj9oohOgGeEopqwkhngO+A5oX\nQlyKojxDnvVmnocVVrPPI2tLZOoFrACQUh4XQjgJIcpIKe8UTmiKojzN7o/osdQ82808DyusO/89\nQggtsExK+cND2zXArWzPQ4AKgEr+iqLkmRrRk7PCSP4tpZThQgg34G8hRICU8tBD+zz8y+A/pUbn\nz5+d9bhFizZ4e7ct+EgVRXkqPKsjeg4cPcpBHx+D9i3Uks5CiJlAgpRyfrbXlgL7pZRrMp8HAG2z\nN/uoks6Kohji/ogeUCN6IOeSzibGvLAQwkYIYZ/52BboDPg9tNtWYHjmPs2BGNXeryjKk/JJDCMq\neDUaNzNa1+/0zCf+3Bi72acMsEkIcf9aq6SUu4UQYwGklMuklNuFEN2EEEFAIjDCyDEpivKUeZbL\nNOSVWslLUZQS6/5KW/a2ZjTwVOP3H6ZW8lIU5anjG++L441nr1O3oKjkryhKiaLKNBQMlfwVRSkx\n7tfeV2Ua8k8lf0VRSgTVqVuwVPJXFKXY84uESsTioWbqFhijjvNXFEXJr/tlGpSCpe78FUUptp7V\nMg2FQSV/RVGKHTWix/hU8lcUpVhRI3oKh0r+iqIUG2pET+FRyV9RlCL3QJkGNaKnUKjkryhKkVJl\nGoqGSv6KohSJ+0M4Vadu0VDJX1GUQueTGIZjiOrULUoq+SuKUqhUp27xoJK/oiiF4n6nrsbNTJVp\nKAaMXt5BCGEqhPAVQvz5iG3thBCxmdt9hRAfGjseRVEK3/3E30xjjUd51cxTHBTGnf9bgD9g/5jt\nB6SUvQohDkVRioBfJDimHqSWxkyN5ilGjL2AewWgG/Aj8MilxHJ4XVGUEu52tH5Ej4ZY1bFbzBi7\n2WchMBXQPWa7BLyFEOeEENuFELWMHI+iKIXEJzGMqODV1NKYqaaeYshozT5CiB7AXSmlrxCi3WN2\nOwNUlFImCSG6ApuB6o/acf782VmPW7Rog7d32wKOWFGUgqJG9BSNA0ePctDHx6B9hZTSKEEIIeYA\nw4AMwApwADZIKYfncMx1oLGUMuqh12VoaKpR4lQUpeA8MKJH3e0XOSuNBinlI5vWjdbsI6WcJqWs\nKKWsAgwG9j2c+IUQZYQQIvNxM/RfRlGPOJ2iFHspKSlcvnyJe/fuFnUoRUJfpkGN6CkpCnOcvwQQ\nQowFkFIuA/oD44QQGUAS+i8JRSlxLl26wLDhfTC1sCAmMoI3xr/DWxPfK+qwCoUq01AyGa3ZpyCp\nZh+luGvbvgHPvzyatr0GEnPvLv8b8SJLvlnOc8893UMbfRLDcAxSZRqKqyJp9lGUZ4VWq+VaUCCt\nu/cDwMm1NLWfa82lSxeKODLj8g/bjWPQAVpryqrEXwKp5K8o+WRqaoqmYmXOHNoDQFJ8HAFnjuHh\nUa2IIzMOn9gU/M/px+63rt9JjeYpoVSzj6IUgFOnjjFy1ABKaypxJzSYvi8O5uOZn5M5nuGp4Rvv\ni+U1VXu/pMip2Uclf0UpINHRUQQGXsTV1Q1PT6+iDqfA+cb74noniAauNVWnbgmhkr+iKPlyf0SP\nmrRVsuSU/FVJZ0VRcpR94RWV+J8eKvkrivJYqkzD00slf0VR/kMtvPL0U0M9FSWbY8cO0cK7JpXd\n7ejesw3BwdeLOqRCp8o0PBtU8leUTHfuhDP6tcEMmjyDHw9cpFbrjgx75UV0usdVJH+6+EWC/7nV\nuN4JonXlumoo51NOJX9FyXTu3Cmq1q5Pw9bPY2FlTffhY7l3L4KIiDtFHZrR+SSGYRqir73fwLO9\nGsr5DFBt/oqSydnZlfDg66SlpmBhaUXUnXBSkpNwcHAs6tCM5nY0RCWrTt1nkUr+ipKpSZPmNGnU\njNkj+1GtXiPOHNzDu+9+jLW1TVGHZhQ+sSk4BqtO3WeVmuSllEg6nY5FX8/lj7W/YW5uzutj3mLI\nkFEFct5du/4kJOQm9eo1emqrcqoyDc8GNclLeeos+34Rm7dvZvy8paSmJLPgw4k4ObnQvfuL+Tqv\niYkJXbv2LqAoi5/7M3Vdbc1ooGrvP9NUh69SIu3YuZWBE96nUvWaVKvXiJ4jJrBj159FHVax5hOb\nojp1lSwq+Sslkq2tHZF3wrKeR94Ow87WrggjKgHMotSiK0oWozf7CCFMgVNAiJSy5yO2fw10Rb+M\n46tSSl9jx6SUfJMnTePVEf24ffMaqSnJnPz7L7Zs/qeowyqWso/oKaVRd/uKXmHc+b8F+JO5hm92\nQohugKeUshowBviuEOJRngJNm7Zg/brduDs4ULu8hu3bDlOlimdRh/VIUkqWfb+I7r3aMGBQF44c\n2V9o1/aJTSEqWC28ovyXUe/8hRAVgG7Ap8DkR+zSC1gBIKU8LoRwEkKUkVI+/bNqlHyrWbMONWvW\nKeowcrVkyXzWbFrNS5M+IjYqgjGvD+H3lVupX7+xUa+rL9OgRvQoj2bsZp+FwFTA4THbNcCtbM9D\ngAqASv7KU2Pt+lWM+GgeVWs3AOB28HW2bFlntOSvRvQohjBa8hdC9ADuSil9hRDtctr1oeePnHgw\nf/7srMctWrTB27ttvmNUlMJgZm5OSmJi1vOUxETMHR93P5Q/2Wvvq47dZ8+Bo0c56ONj0L5Gm+Ql\nhJgDDAMyACv0d/8bpJTDs+2zFNgvpVyT+TwAaPtws4+a5KWUZBs2/M7sOR/Sc+R4Yu9F8M+Glfy5\n9QDu7lUL7Br3O3VRZRqUbHKa5GW0Dl8p5TQpZUUpZRVgMLAve+LPtBUYDiCEaA7EqPZ+xdiOHj1A\n2/YNaNbCi/Hjhxu9ame/fi8z//Nvib8agF1aMls2/1OgiV916ip5USjlHYQQbYEpUspeQoixAFLK\nZZnbFgNdgERghJTyzCOOV3f+SoHw8/Old5/29HhlHJoqnqxfuoDK5Suwft2uog4tT1SZBiUnagF3\nRcn02muDiMWE8Z8sAuBOyE3eG9CR69diiziyJ3O/U9fe1owGrjVVp67ySKq2j6Jk0ukkpuamWc9N\nTEwfPcKgGFOdukpBUMlfeaaMHz+F/gNeQONRDU2Vaqz99nPq12tU1GEZRNXeVwqSavZRirVx44Zy\n1OcQ5ubmfLXwe1q16pDvc+7du4MPZ7xDSmoKDeo14ofvV2Nmlr/7ICklW7eu58jRA7i5uvHaa2/i\n5OSc71iz84uESqnr1Lq6isFUm79SIvXo0ZprN6/Ra+QEwm9c5eBf61m7ZjtNm3oXdWj/seibeaxe\n+xsd+g8jOPAiwf7n2P7XYezs7Avk/D6xKTjeyFx4RSV/xUAq+SslkoenMzN/2oC7l76EwzcfjCf6\n5jX27j1dxJE9SEpJ9RquzF27B9dyGgC+fPMVXh04jH79Xs73+dWIHiWvVIevUiJJqcPO8d+mEwfn\nUty+7F+EET2alJL0tDTsHJ2yXrN3ciYlJTlf51VlGhRjUvX8lWLLyakUi6dP5Jr/OY7u2sI/m9cw\natSEog7rP0xMTOjWoy9LZ7zN9Ut+7N+8hvNHD9CuXd7XxfVJDFMLryhGpZp9lGIrNjaWdh0akJCY\ngBCC/i8OZs6crw069u7d22zcuJq0tDS6du1FtWo1s7YlJiawbt1vREdH0apVB5o2bZHvWJOTk/lk\nzjSOHDmAm1tpZn40lzp1GjzxeVSZBqUgqWYfpUQ6dGgv6RkZdOw/jLu3bnD02CESEuJz7UQNCwuh\nR8821HquFdZ29ix78XlW/LKRJk2ak5SUSO8+7XEoq6FMZQ9+Hj2Q2bO+pE+fQfmK1dramk9nL8zX\nOXxiU3AMzuzUrZ/3Xw2KYohc7/yFEFZAP8Cdf78spJTyf8YN7YEY1J3/M6i5d01GzPgCr4bNAFg0\ndSy9OnTm1VfH5XjcrFnvEpacwstvfwjAwb/Wc/7vP1m3ZgerVv3E2m2beHvBTwghCLrgy5L3x3P6\nZJDR309OVKeuYgz5vfPfAsQAp4GUggxMUXISHxdLmQqVsp67aSoRFxeX63ExcTG4Vame9bxMhcrE\nx+nLN8TFxeCmqYQQInObO/FxMQUcueFUp65SVAxJ/hop5QtGj0RRHvL8811Z+eUsXp48gzu3bnBk\n2wbe+G1Lrsd17tidjz6eimedhtjYO7Bu8Ty6dOwGQOvWHVm8ZAENW3ekfJWqrFk0h+cztxU2VaZB\nKUqGNPt8DyyWUp4vnJAeGYNq9ikk0dFR7NmzHSklHTt2xcXF1ejX9PPz5ezZU5QvX4EOHbpk3ZUn\nJSXywfS32LdvFw4Ojnw47VO6du1t0DlXrFjG4iVfkp6eTt8XBzPtg0+yZvH+/fc2Zs1+n9iYaNq0\n6ci8ud8U2GQsQ6hOXaWw5GmSlxDCL/OhKVANuA7cz8BSSlmvoAN9HJX8C0do6C36vNiBijVqI0xM\nuH7hLFs276NiRXejXfP333/ms3kzadCqA9cunqNRvUZ8veinrC+Ap42aqasUprwmf/fMh5JHLLUo\npbxZUAHmRiX/wjHlnXGk2tgyYPy7AGz8/ivkvTt8vegno1wvLS2N2nXKMXvVdspVqkJaagofvtyV\nRV8upXnz1ka5ZlHzD9tNM4sk1amrFIo8reQlpbwhpbwBfHL/cfbXjBOqUpTuRtyhste/TRDuXnW4\nG2G8hdUSEuIwMTGhXKUqAFhYWlHBoxoREXeNds2i5BvvCxGRWJp5FnUoimLQDN862Z8IIcyAxoac\nXAhhJYQ4LoQ4K4TwF0J89oh92gkhYoUQvpn/PjQsdKWgtfRuw67ffyA+JpqEuBh2rPwe7xZtjHY9\nZ+dSlC2nYfvKH9BptQScOU7AmRM0aGDQf14lhl8k+J9bjeudIFqrET1KMfHY0T5CiGnAB4C1ECI+\n26Z04HtDTi6lTBFCtJdSJmV+aRwWQrSSUh5+aNcDUspeTxq8UrBeGz2RkNBgJnZpCsCgwa/yxrjJ\nRrueEIIVyzcwZFhvVi2cja2tPd9999sDfQyBgf5s3boON7cyDB06+oHSywkJ8Zw8eRQzM3OaNWuJ\npaVlgcR18eJ5QkNv4uVVh0qZv0rySo3oUYorQ0b7zJVSvp/vCwlhAxwAXpFS+md7vR369X175nCs\navMvRFqtFgBTU9Nc9sy/OXOm8+PPS/CoVY+Qq4F4Vq3BX1sPALBu3Uree38C7l51uHc7DEszM44c\nuoCFhQWhobfoN6ATjq5lSE1JxsrMjHV/7MTe3iFf8cydN5PVf6zAvXotgi6cZe5nX9OzZ/8nPo8a\n0aMUB3nt8L2/vJGA/65096iF1h9zHhPgDFAV+E5K+e5D29sCG4EQIBR4J/uXQ+Y+Kvk/hVJSUqhR\n042Pf96IR636JMbHMqVPW2Z++BmDB79CzdrlGP7uLLy79CEjPY2Zr/ahZePnmDfvW15/YxgWZTT0\ne30yUkqWzZxMvSoefPD+7DzHc+HCOYYM782nq3dh7+TMzcv+fPraAM6fC3miXxVqRI9SXOR1hu8C\n9EnfGn0b//1x/vWAU4BB1bCklDqggRDCEdglhGgnpdyfbZczQMXMpqGuwGag+sPnmT//3z/qFi3a\n4O3d1pDLK8XYrVs3MDExwaNWfQBs7R1x96pDQMAFAJKTE6ndTD8qxszcgtrNWnHjagAAwbdu0KvH\nAEDffFSrqTfBvifyFU9IyA3ca9TGPnMFrsrVa2FuYUl0dCRly5Y36By+8b443lBlGpSiceDoUQ76\n+Bi072OTv5SyHYAQYiPwmpTSL/N5HWDWkwYlpYwVQmwDmgD7s70en+3xDiHEEiGEi5QyKvvxU6Z8\n9KSXVIq5KlU8MTEx5ejOzXh36UPItctcPneKqW/o+xmcnFzYveYX+o+bQlzUPY7u3MzIYa8B0LB+\nY/Zv+p3q9ZuQnp7G4T/X07d7n3zFU7NmXa74+RJ85RKVqtXk+J5tWJhb4OZWJtdjVZkGpTho6+1N\nW+9/V7r7dMGCx+5rSJu/v5SyVm6vPeZYVyBDShkjhLAGdgGzpJR7s+1TBrgrpZRCiGbAWiml+0Pn\nUc0+hUSr1XL58iUAqlev+UTt/qdPH+Pq1Su0afO8wXfKmzf/weR3XsfE1JT01FQGD3qFefMWA+Dr\ne5LBL3cnLS2NjPR0WrRow9o/dgD6sswjRw3k9Jnj6LRaevTox8IF3xsc77Fjh7h16ybt23fG1bX0\nA/G8+954LKyssDCzYPnP66hfP+fRRz6JYTgGGadTV0pJ0LVrJCQlUbNaNaysrAr0/MrTLb+F3c4L\nIX4EVqJv/38ZOGfgtcsBKzLb/U2A36SUe4UQYwGklMuA/sA4IUQGkAQMNvDcSgFLTExg6PA+3AoN\nRiAoW6Ycv6/calAn6oBBXTh9+jjObmV4/4OJzPl0EYMHv5LrcRERd9BqM7B1cCQ9LY07d29nbbO3\nd8DBwRFzKxtio+7h7l4VKSVCCFJTU4iOicLFtTTpaWlERNwhPT091+Sv0+no0s2bq9eu4Ohcivc/\nmMi3i3+lSxf9eIM+fQbRpUtvoqMjcXMrk+PC7vc7dR2N1Kmr1WoZOeEN9u7/BwdrCzC3Zvv6jbhX\nrFig11GeTYbc+VsD44D7Uy4Pou+4LbQKn+rOv3DM/uQDLt68zthZC0EIfpz9Lh6ubsz+3/wcj/v1\n12V89vksPl+/FwfnUpzct4OlM94m6EpUjselpaVRw8uNKQt/om7zNkTdDefdAR2Z//l39OrVn569\n21Hv+W50HvQqyYkJzBkzkHcnTaNHj75Mmvwa8cKUYVNnodNq+frd12nbtDlvT5qW4zXnzZvBmo1r\nmPP7Dmzs7Nm7YRXrl3yB/4WwJ/qsCqNT95c1a/hm/ifMbOmGpZkJGy5FE2pThW3rNhjlesrTJ08z\nfO+TUiZLKRdIKV/M/LewMBO/UngCrwTQ5PlumJiaYmJiQtMOXbl8JSDX486cOUntZi1xcC4FQJP2\nXUhJSc61/PLNm9cQQlC3uX4imUvpclSr24jTp48BEBQUQPPO+jtya1s76jRvw5XMeC5fCaBZx+4I\nITA1M6Nxhy5ZzVU58fM7S+O2nbDJLOTWonNPEuJjcz0uO32n7iaaaayNOprnUmAAjdzMsDTT/5l6\nV7AlIOiK0a6nPFsem/yFEOsy//eCEMLvoX9FVuFTMZ4a1bw4tXc7Oq0WnU7Hyb3bqV7NK9fjGjd+\njosnjhAbdQ+Ak/t2YmVljYNDzs1FVap4IqXE79hBACLvhHPF7wxNm+o7rDw9vTi2+08AkhMT8PM5\nQLXMeKpX8+LEnm1IKclIT+f0vp1Ur17z0RfKpm7dBpw+8DdJ8fovJp/df2Jn4NyA8CjJqeiT+pm6\n9TsZfTRPzRpenInIIDVDB8CRkES8PKsZ9ZrKsyOncf7lpZRh2Qq8PSCzxk+hUM0+heN+m39wyE1M\nhMkTtfkPHNSVU6d9cHYtQ/S9u8z55CuD2vx//PEbPvl0Ok6l3IiJukf7dp1Z/vM6AIKCAnlpSI+s\nNv+ePfrx2ZxFCCGIiopk8MvdiY2PIy01lZo1avHzT+ty7RDV6XR07d6SoKuXcXAuRVzUPb779lc6\nd+6Ra6wzln7GT7M/5qfP/seQ4aNy3T+/dDodI8ePY8/+f3CwMgdLG7avK5g2/4yMDMLv3MHVxQVr\na+sCiFYpjvI0yStrByFGoy+/UGS/N1XyLzxarZYrVwKQUj7xaJ8zZ45z9eoVWrfuYPBonwsXzjLs\nlRdJTUsjPTWFOZ8uYsCAoYB+Ld4hw3oTHh5CWmoq48e/w5TJ/5Z+Sk9P58qVAMzMzPD0rIGJiSGl\nqvSOHTtEaOgt2rfvbNCaBT6JYZQPPcSUt+fic1Y/B/FuQAAO9sZdB0BKydXr10lISsLL07NARvuc\nvXCBfsOGkJqSTEq6lkWffcaQAQMLIFqluMlv8v8f0Aqogn5y10HgkJTybEEHmkMMKvk/hXQ6Hc+1\n8KLvG+/g3aUPodeuMGfsIDZv3IunZw0GvdSNMl716Dv2beKi7vHJawOYO3sBHTp0KbQYH1Wm4fS5\nc7Tspl/9a/7//sf4Ucb/FVBQdDodNZo0YqCHOW3cHbgVm8qMQ3fZ/9d2qnuqaqNPm/x2+M6QUnYA\nagGHgXfRr+erKPkSHR1JfHws3l30k7M0HtWo0aAJly7pZ/ieP3+GjgOHI4TAsZQbjdt34fx530KL\nzyc2hajg1WiIpXX9TllDORvXr09ySAg9OndmyowZWGk03IvKeWRTcREZHU1sfDxt3PVNeRUdLalV\nxg6/S7l3litPl1yTvxDiIyHEDmA34AlMAdRAYyXfHB2dQUqu+eunjSTExXD90gUqZC7aXqFCZS6e\n0BeAzUhP48rZE1nbjC23ET1CCNYvX87Z/fv1sdatyyfzcx4SWxw4OzqCMOFKZDIACWlarkYmUalC\nhSKOTClshjT7+KIv47wNfZPPUSllobbBqGafvElPT8fMzOyJl0TMqaqnTqcjJSUFGxubAolxx44t\nvDN1HB616nHraiD9+77Eh9PnAHDu3GmGDe9DxWpe3LsdRm2v2vywbLVRq43eL9Ngb2tGA9eaBpVp\nkFLyxtSpLF+9GoCgkyepUN6wPo/7CrOS6pYdO3h90ptUc7PjRlQSQwa9xGczP35gn7z+t1MUSlKs\nhS2/zT4NgY7ACaATcEEI8XA9fqUYCQ8PpVefdlT1dKJ2nXJs3vyHQcdptVqmf/Q2np7OeHo68977\nE8nIyMja/uqIfrhXcaB6jVI0aeZJeHhovmP18qqNq1sZzvocID015YHFY+rXb8w/+84wedxkFi/8\nkR+/X2PU5OiTGIZpyGpqacxo4Nne4Po8Qgi++/JLAo7p5yd4Nm3K2x8atiaRVqtl0gfv41zVA+eq\nHoyfOuWBz9wYenftyrG9//D+zM/5c+2GBxJ/ZFQU3fr3xblqVdyqefL9ihVGjSU/QsPDadutC85V\nPSjnVYM/Nm8u6pBKFEPu/Ouin93bBn1RthDgoJRyhvHDy4pB3fk/gd4vtqdivSb0GzuZ4KAAvpgw\njD9Wb6d27Xo5Hvfd0oVs/Gsjkxb8iDAxYdGUMXTr2IW3Jr7HF1/MYsWqn/h4+SacXEuz7ON3uH3l\nEocO5n3Kh5SStu0b4N1rEC+8NJLL507x9Ttj2LH9SL4XUXkSt6MhKlh/114QZRo+nDOHL7/9FoAL\nhw7h6eHx2H0XLvmWVcu/4/3mrggBnx+/x4svjeL9SW/nK4a86j9sCDLsIiPru3AnIZ2PD99l5U+/\n0KaFQUV8C1W77l2prL3NoFrO3IxNZfaRu+zcsJl6tWsXdWjFRr7u/IHPAHvga6CmlLJdYSZ+5clo\ntVp8Tx/nxdcmYWJqinuN2jRq04lTp3Iv83rU5xCdXx6NnaMztvaOdBn6Gj4+hwDYf+BvOg96Fbfy\nFTG3sKT/65MJv52/O//o6Eju3Amn65DRmJiY4NWwGV6NmnHunEFLRRSIrE5dN7MHOnXz45Np07h5\nVj8Yrk7r1gwbN47H3WQdOHSAHh422FuaYmdhSk8PWw4eOpDvGPLq8MmTDPRyxMxEoHGwoHUFa44c\nP15k8TyOVqvl5PkLDKzljKmJwMPZimYaO46dOlXUoZUYhjT79JBSzpNSHpVSphdGUEremZqa4ujk\nwo1A/YgZbUYGwVf8KV069yaM0qXLcOPSv3fyNwIuULq0vpyxm2tprpw/k5XErl/yw9Iyf2PO7e0d\n0Wm1hN+8BkBaSjIhVy9nXdPYjFmmoYybGymhoXw6fTrrtm7FukIFzl+8+N/9ypTjasy/f1bXYtIo\nW7ZcgcbyJEq7uBAUra/eopOSm/E6ypQunctRhc/U1BQnezuuZcaq1UluxKYXy1iLq1ybfYoD1ezz\nZLZt28S770+gYasOhFwNpFL5Ciz/aV2u7eVhYSH07tOeitVrIUxMuOF/ni2b91GhQmXu3btL67b1\nKF3RnVJlynP28D5m/28BQ4aMzFesq1cvZ87cGTRo2Z5r/udpXL8xi7760aiddw906noaf6WtmNhY\nytbSV0Bv36oV21avzpqQFhIWRrse3ahsC0LAtTgd//y1ncpFNPrmn8OHeXn0SBqXs+VOUgZ2pSuy\nc/2mAlsfuSBt2raNN6ZMokl5e4Lj0vCoUZf1v64slE7zkiJfk7yKA5X8n1xgoD+nTvng6upGx47d\nDfqDkFLy+Rcf88svy5BSMnToaKZP+yQrEUdHR/HVV3OIi4tl8OBXeO65f2vbHD16gA+mT+JexB2a\nNWvJ/C+X4uKiL/S2fPl3zJk3g/TUVCwsrZjzyUL69x+Sdayfny/nz5+hXDkN7du/YNTEf78aZ1Es\nqP7z779wiNttAAAgAElEQVTzxtSpAOzbvBnvpk0BiIqOZsfevUgp6fL887i6uBRqXA+7ev06B48d\nw8nBge6dOmFhYZHvc378+ecsXvot6RlaNGXLsnfrX5Qrm/uv0Ru3bjF6/DguXL6MR8WKfP/1YurU\n/LeGk39gID6nTlHa1ZVuHTuqxP8QlfwVg6xc+SPf/fQt4+csRpiYsGT6BEYMGc3IEeNyPO7mzWt0\n696K0TO/xKNWPbb89DUJYbdYu2YHV69epmPnZgyd/BFN23fh0LaNbFi2gBPHLuPq6lZI7+xfPolh\ntEg+UuiJ/76k5GTK165NSmoqdWvV4tjOnU99wlq3ZQuvvTmB6W00aOwt+Mn3LqEZNgSczLlvJz09\nnYZtWtGyVDodKttzKjyRdUEpnDt8FMdcigYqennq8BVC/JnDv63GC1cpKrv37KDXqIloPKpR3r0q\nfV6bxJ69O3I97tixQ9Rr0ZZGbTri5FqaoVM+5vixQ6SlpbF69XJcy5anY/9hOJZyo8fwsdjaO7B5\n85pCeEf/uh0N/mG7cQw6QClyr+VjLDbW1sRcu8aqpUvx8/fHtlIldv/zT5HFUxhWbdhAhyqO1Ctj\nSykbc15vUpaQ23dzPe5GcDCJ8bH09XLGydqMjh6OuFqbcu4RfSfKk8tpJa98TVcUQlgBBwBLwALY\nIqX84BH7fQ10Rb+K16tSysKbv688wMnRibshN7Oe3wm5iaOjU67HOTg4ERF2C51Oh4mJCRFht7Cw\nsMTc3ByNphKxUZGkJidjaW1NUnwcifGxVK78+OGPBc0nNgXH4MyFV+p3KrTr5qRfz570fOEFarVs\nSa+hQylbujSBx44Vy7b1/HJxcuLixbSsVdjC49MwN819oKGDgwPxKWkkpGmxszAlNUNHZGIqTuqu\nv0AYtdlHCGEjpUwSQpihrwv0jpTycLbt3YAJUspuQojngEVSyuaPOI9q9ikEQUGB9O3XkUbtXsDE\n1ISTe7azft1uatTIebnm9PR0Bg7uSpowobJXHXx2buHNCVMZ8erraLVa6jesjI2jM43bduL4nm3I\n9HR8T18rlPfkG++L5bUAmmmsjV5/P6/+PnCAni+/DMBvS5YwoHfvIo6oYN29d486zZtRzcmMyk6W\n7AqKYfiQ4Sz89NNcj33v4xls3byeJqXNuRClpW7jFqxY+r2azWugfI3zF0JUF0KsF0JcEkJcz/xn\n0F+ulDIp86EFYAo8XP2qF7Aic9/jgFPmgu5KLiIjIxg3fjht2zfk1ZH9Ccl2x56cnMz0DyfRrkMj\n+g98AT8/w35MeXrWYMf2IzznVZMmntXZvu1wrokfwNzcnKEvj+JG4EX2rF+JtZUVPXv0BfRD8v7a\nepCU2Gi2//Y9Mi2VHdv+nSAeFxfL21PG0LZ9Q14a0pOgoNxXDjOEXyT4n1v9yIVXfE6epEPPbjRs\n1YJ3Z35EaqphNxYBV65Qs2ljSld1x7NBXY6ePJm1LTk5mbfef5cGLVvQqU9PfP38DI41PDycetU9\nsbK0YNgbb2Dv7k5iUlKux52/eJEX+vamQcsWTJg6xaBjjGnTtm14d+pAk7at+Oq7JVnDgku7urJi\n6ffcSjVn940kmjdrwdwZhk0VmjtzFp/PW4hXl1d4b/psfvlumUr8BcSQGb5HgJnAAqAnMAIwlVJ+\nlOvJ9Qu3nwGqol/3992Htv8JfCalPJr5fA/wnpTy9EP7qTv/bLRaLd17tqZinYa06tGfs4f2cmzb\nBvbuOYWNjS3jxg/nTnw8vUdN5EbgRdZ9M5ddO33QaIxTFM3Pz5defdozbMpMqtauz6YfFhEWFMCJ\nY4EkJyfTqXNTGnfqSaN2nfHZuZmgUz7s3H4UMzMzBr/cHVOnUnQePIJA3xPsWLGUvXtOGlRj/3F8\nEsNwDDrwyBE9V65epU33royo44DGwZLVl+Ko792Jb+cvyPGcaWlpVKxdk1YaazpWdeRkaAKbA2MI\nOHma0q6uDBszmjD/E/Svbs+1mFRWXYrn2J59VNJocjzvuq1beW/aVMY1dEYAi05GEhGvT+KL581j\n9NChjzwu7PZtmnZox+Aatng6W7HpSjzOVRvyxy9FU45h78GDvDp2NOMaOmNjbsKP52MZ8/qbvDn2\n9Tx/5kr+5XeGr7WUcg/6L4qbUsqPge6GXFhKqZNSNgAqAG2EEO0esdvDgRX/4UdF7NatG9y5c5sh\nk2fgXqM2fUa/iZWDI35+vmi1WnZs28TY/y3E3asO7XoPop53W/bv/9to8fz66w/Ubd6a5/sNwd2r\nDuPnfEN46C2SkpK4dMkPEwtL+o59G/catRn85jRi4mK4du0KsbExnDrpw6jpc3GvUZsXBo+gUvWa\nHD9+JE9x3I7W3+07Bh2gtabsI0f0bN+zh5YaG9q6O+LpYsWERi6s+zP38QtHTpxAp03ntcal8XC2\nYlAdV0pZm7Lhzz/RarVs3rWbN5uUwsPFio4ejjQsa8PfmRU/c/LHujW85OVAvTK21C1jy8j6TnRq\n2ZyObdsy4b33sNJoiI6J+c9x+w4dok5pazpXdcLDxYqJTUqx/Z9/SE8vmnmY6zZtoE81WxqXt6Om\nmw0j6jqyZv1aIO+fuWJcOXX43pcihDAFgoQQE4AwwPZJLiKljBVCbENfG2h/tk2hPFgeukLma/8x\nf/7srMctWrTB27vtk4TwVLG0tCIlOYnUlGSsrG3QZmSQGBeLlZU1JiYmmFtYkBATjZW1vvJmfEw0\nVlbGW6rPxsaGuKjIrA69xLhYhBBYWFhgaWlFYkIcGenpmJmbk56aQkpSElZWVlhYWKDVaklOjMfO\n0RkpJfGx0XlareqBTt0cZutaWVoSn/7v/UVcqhYrA8axO9jbk67VkaqVWJkJtDpJUpoWezs7/Wdu\nZkZ8qharzMXW41N1WBvwPqysrImL/beQW1yqFhsnW/745Vf8/P1p2qkT5WrX5pNp03hn/PgH30eq\nNuszT0jTYWpiUmTDRq2trYlOy/65ZmCd+d9cXj9z5ckdOHqUgz65l3IBw5p9mgGXACdgNuAAfC6l\nPJbLca5AhpQyRghhDewCZkkp92bbJ3uHb3PgK9Xha5i3Jo3C/+oVmnbswQWf/diamLBq5VZMTEz4\n5pvPWfXHr7TvP5Tgy/7c8j/Hjm1HsLW1M0osUVH3aNGyFvW821GtXmN2rPoRT3cP1q3dhU6n45UR\n/YhOTqZeyw6c3reDKpoKfPftrwgh+HjWu+w7tI9WPQdw5exJEiPusHnj3ieaWOQfpl9py5BO3ajo\naFp07kgt+ww0tiZsv57MO5On8sbI3FfjqtviOWTCPTpUceR4SDy308y5evY8ZmZmzFu0iJ9/Xkqn\nSlbcjNdxM92Go7v3YGeb833SmfPn6T6wP13crTERgu3Xk9i8ajXPNW4M6CfejXzzTVZv3AjADV9f\nypYuTVJyMq26dKYcsXg4mrInOIUhQ0fy4TtTDfzUCtaVq1dp26M7HSpaYmsm+PNaIiuW/kCndu3y\n9Zkr+VMgk7yEEA4AUso4A/evi74z1yTz329Syi+EEGMzz7Msc7/FQBcgERghpfzPzA+V/P9Lq9Wy\natVPXLh4Ho8qVRkx4o2sYYIZGRlMnPgqp8+cwM7enq8W/EC9eo2MGk9o6C0mTRpFVHQULZq34n//\nW5BVwiAtLY0VK5ZyJegyNb1qM3z4mKw7VCkla9f+yhnfU2jKaxg9eiI2Nob9sMxrmYaIyEgW//AD\nkZH3eKFjJ3q+8IJBx0XHxNCmR3ci7t7GwdGZXRs2UqWSvh8lIyODERPGc+LkMRwcnfj+q29oWC/n\nKqpZ78PfnxWrf0dKyfDBL1G/Tp2sbWfOn2fu/C+4ey+CY2f1nchvv/46n330EX6XLjFi/Dhio6Np\n3LgJvy75LutLMzomho8+/YTLlwOoXbsO/5v2IfZ2uX/563Q6vv3xB7bt2IaTszMfvfsBtb28DHof\nqzdu5JPP55KRkcGQQS8xY+q/X0R5/cyV/MnvGr5NgZ/R3/EDxACjpJSFVj5PJf8n88G0Nzlz0Y9u\nr7zOzYAL7Fv/G3t2n8DV9ekpepVTp64xSCnp2r8v3LtGW40Vp++mciPDgSO7/sbS0pKJ707lxIEd\n9K5qy7XYNP6+lc7Jffsp7Zr3juuAK1do37M7g7zscbEyZXVAAuXcq3H0pP5Pr7SLEx0rWODpbMFf\n15Ko16IDSxcuIj09ndZdX6C8jKJpGUsOh6WQbF+BPVv+zHWR+1nz5rJx7W8MrG5HeEI6G4OSOLJr\nd9aX3OMcP32aPkNeYmgte2zNTfn1YhyzPprFsEGD8vz+lfzLb4fvz8AbUsrKUsrKwPjM15RiSKfT\nsWbNCt78YikNWran96iJVG/QlD17thd1aAXigU7dynULrUzDzZAQ/C5e4M3GpWhU3o7R9V1IiYvi\nzPnz6HQ6fl27lqnNXGlc3o4BNV3wcjZj+549+brm6o0b6FDRmq6eTjxXwZ6JjZy5HR7K1cyyxXej\nYrgTn0qjcrZMfa4UqzZuIiMjg/P+/sRG3mFMAxcalbdjfKNSXLsWRNC13Edo/7zyNyY1caGJxo6e\nNZxpUd6STdu25Xrcr2t+p3dVGzp6ONGioj1j6jvx/fIf8/X+FeMyJPlnSCkP3X+SOUnLuEsNKfki\nhECn02U912oznoqx0bej4bpF2L+19w1caasgCCGQ8sGhaLrMztb723XZfkVrJfn+zIUwQZftgtrM\n62nKleOXxYup6GTD3uux9FkTSHBs6gOxaHXyP7Hmdtf/yPehM+x9CAS6bM+12T4bpXgyZLTPASHE\nMmB15vNBma81AnhUG71SdExMTBg+fAwL3x7FC0NGExx4kZv+fnRaWPLvwq6bpFAr9ggeFoVfm6eS\nRkPTRo2Yf+ISrcpb4ns3DWe3cjSuXx8TExNeGzaMuds30sPDhuuxGVyPl/TolL9SEsMGDKD18p9w\nsIzC1dqMtZcTeHuSvh29S4cOzLCyp0cNa/4KjOS9v4OprNFgampKvVq1KKupxOJT4TQuY8HR8FRq\n16pD1Sq5r442btRrfLl8Kf2q2XE7MYNTd9P4pmfPXI8bNfwVug3YiqWpwMbclDWX4vl8zn+quSjF\niCFt/vvJYey9lNLov7tVm/+T0el0/PzzEg4d3U9ptzK8/dYHlC9fNPXhC0pxKNOQkpLC3K8Wcu78\nWapVq86H77yLg709oO/wbd+7F1evBGJhZc3yJUtp3yr/cfoHBvLF1wtJiI/nxd59eblfv6xtR0+c\n4OXRo0hNScLOwZlb4eEAHN62jRqensyZ/yUBgZeoU6ce096ejLV17sN9pZQs//13tu/chpOTCx9M\nnvLAl8aBo0dZueZ3LCwsGDti1ANLJp46e5ZFSxaTlpbK0JeGGtypGxcfz/xvF3PjxnWaNm7KuJEj\nn/pKp4VFlXRWSqzCXnglrzr26cX1AD/61nQhKCqFA8EJnDt8lIq5zPDNq/Dbt6nj/RwtK9hSo5Q1\nmwOiKFO5Gqf89BUvmzVsyP6tWw1q6jHUrn37GPnG6/StbkuqVvLn1SR2rt/4wOikJ5WamkrbHl1x\nSYukdikzDoSk0sC7A99/9XWBxf0sy29tn7JCiJ+EEDszn9cSQqgBuorR+SSGYRqymlqa4p34dTod\nPqdOM7tDRTpWdeL1pmWp7mLJZ199ZbRrfvHtt1RxtGB8s3J0rOrEp89X4syFi8Rdv87yb77hhK8v\nNhUr8s/hw7mfzEDzv17IqHqO9KjuQr+apehV1Zbvfspfc+Lh48dJiY7gzSal6OjhxLQWbqzbupWY\n2NgCilp5HENuC34BdgPlM59fAd42VkCKUlQjevJKp9MhAWvzf/+crM1NSE1LM9o1U9PSsmYTA1mP\nMzIyeKlvX2KuXsXF2ZmugwZRy9u7QMo+pKenP3RNQVpa/n6Rp6enY2VuktU5bGEqMDMxIT1DjSkx\nNkOSv6uU8g9AC5C5iLv6f0YxCp/YFKKCVxfJiJ68MjMzw7NyRT47FMrFu0n8GRjFmfBEJowebbRr\njhsxAr+7SWwJiMQ/IonPDoXirimHjY2+pIeVlRVhFy6w8ZdfuHbzJvbu7mzduTNf13xlyHCW+8Vy\nJjwBn1vxrL+cwJBBL+XrnN7NmhGZZsJa/yj8I5JYfDqSZo0aFvlSls8CQzt8+wF7pJQNM8swzJNS\nFlpxHdXm/2x4kjINoJ8Zu27rFizMzRk6cBDuFSvmekx+SClZvXEj5y9ewNOjKq8OHoyZmX7AXEJC\nAj0GDeTylUCsra1Z9MUCenTunHXsL6tXs/z3VdhY2zDv44+pVyv3Utmgr4W//PffiU+Ip3unzrTI\nXPcX9G3w4ye/RVJSEh4eVdm+dj0Oj1joJCMjg8bPP09gUBAO9vbc9PU1qPP3UZb/vooVK3/FzMyM\ntye8Rfds7zGvgkNDeW/GdG4GB9OkUWPmzPg417IYimHyO8O3MfANUBu4CLgB/aWU5wo60BxiUMn/\nKZaXTt2jJ0/Sb9gQOlayJlULR2+nceCv7QYNZ8yrie9O5cDe7TxX2pzzURlU8mrI2l9+zXU8+ycL\nFjB/0UJ61XAmKjmDQ8HxHNi2I9eO0rv37uHduSO1HSUulrD7RjLfffU1vbp0yVP8B3186Ny/PwA/\nLFzIsIED83QepeTI92gfIYQ5UCPzaWBm00+hUcn/6XV/COeTlmnoPqAvdbTBdPBwBGD1hUjs6nVk\n8edfGiXO8Dt3qNfSm++7V8TG3JR0reStPeGs/30tDevWzfHYctWrMqGxK000+to63528TbxLNfZu\n2pzjcXO/+gqfLT/zRmP9QvdnwxP545YpZw7mreQ16PsnOvfvz+HjxwG4GxCQNVxVefrkdQH3ZkKI\ncpDVzt8YmAPMF0KoBjklX+536lpeC8hTp25CQgKlbP6do1jK2pSE+PiCDjNLUlISdlbmWGd2eJqb\nCpyszUlITMz12Ayt9oFY3WzNSTLguITEBFws//27LWVjlu/VukxMTNizcSNHd+wAoLSXF4t/LPkT\nAJUnl1OH7zIgFUAI0QaYi75KZxzwvfFDU55WBdGp+2KvPqz0j+dGdAqXIpLYdCWRPj2Nt/Zt5YoV\ncXZxZfXFaMLj0/jrcjQx6YIGBoxxr12zFt+dusOt2FTO30lk46VIhgzMveBZ984vsPtGMmdvJxIS\nl8rPfrH06tajIN4OjerVIzkkhN5du/LOzJlYaTREREYWyLmVkuGxzT5CiHNSyvqZj78FIjJX8Xpg\nW6EEqZp9nho+sSk4Jh6ktYU5uObcXJITnU7H3K8W8uvq3zEzM2XyhDcZOeTRSx4+6Xm37txJcEgI\njerXp9Vzz2VtCw0P543Jk/C7dImqlSuzZMFXVKtaNWv76XPnOHriBG6urvTr0QNzc3NAPzO4Tfeu\nBFy5gokw4ZWXh7Dos88MimfLjh18PGc2iUlJ9O7egzkfzcw6b0EJDAqiflv9+I1pkyY9UIpZKdny\n1OYvhLgANJRSpgshAoExUsoDmdsuSilrP/JAI1DJ/+lwv2O3KEs05ERKyfCxr3H+9DGqu5hzMiyJ\nSRMmMWncuFyPXb1hA1M//IDmGltuxmfgoqnKtrXrMTMzy+qcbq6xJSZVxz2dNQe378TZyakQ3pXh\nJrz3Hj+uXAnAlRMnjDY7WSk8eU3+09Gv1XsP/VKLjaWUOiFENeAXKWVLYwX8iFhU8i/h7if+3JZZ\nLEo+J08yfNRwFjxfFgtTEyIS05m46xahF/2xyWFopJSS8jW9mNnSlSrOVuik5MNDEUyf8Rl9e/Sg\nddfOtLWLpnVl/TDMr09G0KrPCD54u/jNlbwZEkKNzF87Y195hUVz5hRxREp+5KnDV0r5KTAFWA60\nklLer9gqgIkFHqXy1MpepqG4Jn6AyOhoyjtaYWGq/7NwtTHDytyM2LicF6/T6XTEJSZR0VG/kpqJ\nEFSwNyMyKkp/3qgoKmVuA6hoZ8K9yHtGehf5U7lCBVJCQ3l34kSWrViBlUbDlatXizosxQhynOEr\npfSRUm6SUiZme+2yoWWchRAVhRD/CCEuCiEuCCHefMQ+7YQQsUII38x/Hz7521CKo5JWpqFx/fpc\nuZfEydAEUjJ0bAqIobSbG2Xc3HI8ztTUlJZNGvKbXxTJ6Tr8I5I4EZpAy8w76OfbtmNNQBwJaVpu\nxaay+2Yyz7dtVwjvKO/+9/77BJ/TT+Wp26YNQ19/nZJQBFIxnFGregohygJlpZRnhRB2wGmgj5Ty\nUrZ92gGTpZS9cjiPavYpYXxiU3C8sSnfzTwhYWH4nDyJk5MTHVq1Mnqp38PHj/PK62O4GxlNTU8P\n/vjl1weWMLxw6RIXAwOp6u5OkwYNsl6/e+8er74+hsOnTuPq5MjXn3+ZNcM3OTmZcVPeZuvOXVhZ\nWjL9nXcYP8p4pR8McTkoCN8LF6hQvjzeTZvmOFHtq6VLeX/2bACO79qVryqeSuEqNiWdhRCbgW+k\nlHuzvdYOmCKlfOyKESr5lyz3yzS01pTN14gen5Mn6Td8KLXcbLiTkEYlTy82r1pT4KNdsnvno+ls\n3LQBz1I2XLiTwBezP2HIAP1M2KXLf+aTz+dSq4wdgRGJjBw+gpnvvW+0WIzlj02bmPTBu9QtY8/V\nqGS6d+vBormf5/gFEBMbS9nMkhRtvb3Z8ccfBVouWjGOYpH8hRDuwAGgtpQyIdvrbYGNQAgQCrwj\npfR/6FiV/EuAgq6937B1S3qXT6dFRXu0OsmsIxGMn/whw420KPiZ8+fp+9IAFjxfDlsLU27FpvLe\nP2GEXvAnOSWFak0asaCjhjJ2FsSlZjDp7zD2b9v5wHDP4i49PZ1yNb34tF1Z3J2sSE7XMWVfOKuW\nr6R5kya5Hr989WrGvfMOAHs3baJls2bGDlnJh5ySvyHLOOZbZpPPeuCt7Ik/0xmgopQySQjRFdgM\nVH/4HPPnz8563KJFG7y9C62unGIA33hfLEOevExDTsLvRlCzfjkATE0Eno4mhGauVmUMoeHhuLvY\nYGuhb1qq6GiJpakpUTExxCck4GRjRRk7CwAcLM3QONkSevt2iUr+sfHxCCTuTlaAvvR0FRcbgz/X\nES+9xKA+fahYrx7Pv/gitb28OL5rV1aBO2OYOnMmlStWzKqS2uPll6mo0fDdF18A8N6sWWjKlePN\nMWOMFkNJceDoUQ76+Bi0r9GTf2ZdoA3ASinlf4qZSCnjsz3eIYRYIoRwkVJGZd9vypSPjB2qkge3\noyEqeDWWQOvKdQu0BHPThvXZejmIYXWduZeUwbHwVEY2alRg539Yvdq1CbybwJVIa6qVsuafG3HY\n2tlRxs0NFycn0qUJPrfiaVHRnot3k7gVk0TN6v+5TynWSjk7U8rFhb+vxdLJw5FrUSlcvJNg0Ezl\n+2ysrYm8coWNf/3Fy2PHYle5Mlt++40XOnQwSszezZqx4c8/mTB6NDqdjqjo6AfKahw/fZovZs0y\nyrVLmrbe3rT19s56/umCBY/d19gdvgJ9SYhIKeUjBzULIcoAd6WUUgjRDFgrpXR/aB/V7FMMPUmn\nbnBoKNdv3sSzShU05coZdP47ERH0Hz6EC4GXkRJmTp3K22+ML4jQH2vrzp2MmPAG2gwtLk6ObPl9\nDXUz27pPnT3LwFeHExefgIWFBSuWLqNT25L3C/TS5cv0HTaEu/fuYWJiytIFC+lnwCLtj5KWlkad\nVq0IDg2ljJsbl48fx9LSMvcDn0DY7du06dmToJMnuXDpEl8tW8adiAh+W7IEaysrKjdsSMj580b9\n9VFSFWWzT0tgKHBeCOGb+do0oBKAlHIZ0B8YJ4TIAJKAwUaOSSkA/mG7cTSwU/fH337lw9n/o5KL\nLcHRSSycM5eXsi1E/jhl3Nw4tGM3sXFx2FhbG7Wj976fVvyCNi2N0nYW3I6MZPP27VnJv0mDBlz1\nPUd0TAxOjo4ltsOzZvXq+B87QUxsLA729vkaQWVhYcHlEyfYc/AgPV56CUcPD35dsoSBvQuuzlL5\nsmUxMzPjVmgox06f5rnGjQm7fZvjp09jb29PbS8vlfjzQC3grjyR29EQbuaL650ggzp1Q8LCaNyu\nDfPal6OcvQXBsalM33+bgBMncXF2LoSIDbdlxw5GjhvD4u4euFibEXgvmen7ggn3D8DOzq6owyv2\ntFotbXr25PS5c5iamnLn0qUCW5Tl1YkT6d6xI7v++Ye3xowh7PZtfE6dwtHBgajoaGZ/8EGBXOdp\nk68F3BUlu+sWYXjEB9HAtaZB+98MCaGCsy3l7PUdpZUcLSllZ0lIWJgxw8yTk76+eLhY4WKtv4us\n4WqNuYkgUM1wNYipqSlHtm/nwNataLVaXKtX54dffy2Qc3s3aYLPqVNcDAigTs2aNGvUiGOnTnHs\n1ClaGDBKSfkvlfwVg/nG+1Ih/AgeFq4Gd+xWdXcnNCaJa9EpAATcSyY6OR33bBOniou23t4ERaYQ\nHq9feP1MWAJaCbVr1MjlSCW75xo3JjkkhM7t2zPxgw+w0miIio7O1zmbN2nC9j17cHF2RgiBs5MT\nsXFxHD992qAhqsp/qeT/DNJqtVy/HsSdO4YN78u+8EoD15pPNHGrbOnSfPvlAj4+dJeJf4fzmc89\nfvn2uwdWj0pJSSEwKIiY2Ngnfi8FqVO7dvTt3YeJ268zYnMQcw+HMW/WbKysrIo0LmMw9mcuhGDr\nypWc2rMHgPJ16vD5N9/k+Xy1vbyIio6mWbbRXnVq1sTRwaHYNR+WFKrN/xkTEXGHl4f24t69CJKT\nE+nevS9fzPv2sZ2XBVWmIS4+ntDwcCpqNA+0A584c4b+rwzDQuiISUrlsxkf89rw4Xm+Tn79sno1\nk6dPw9rcFEzNWLfiN7yzLZr+NCjsz1xKyehJk1i1fj0A18+coVyZMka7nvKvYjHDNz9U8i84o14b\njLlrGQa/NY3U5CTmvTGEMa+MYfDgV/+zb0GVaXgcnU6HR4P6jKhlTfMK9oTHpzH9wB3+3vIntYqg\nqRxHCcAAAA+OSURBVCXo2jVad32BT9uWReNgwamwBJaej+f62adnGGFRfubXbt6kVuYY9EljxzJ3\nxgyjXk9RHb5KNpcu+dG61wCEEFjZ2NKkYzcuXDz3wD5+kfpmHvukWP0yi0ZI/KAvoZyUnETzCvom\noHL2FtQqY4d/YKBRrpcb/8uXqVHaDo2DvnO6SXk7dBnp3ImIKJJ4jKEoP3OPypVJCQ3lzTFj+GrZ\nMqw0Gq4HB2dt33/kCJ379ycpOdnosSgq+T9zKlf24NzhfQBkpKdz0ecgHlU8s7b7xvtmrbZVEPV5\ncuLi5ISpqRmXIvSLksemZBAYkYiHu7tRr/s4HpUrE3QvkejkDAAC7yWTIcGtVKkiiccYisNn/vnM\nmVw7fRqAmi1aMHrSJOYtWsSLw4dz7NQpRr31liofXQhUs88z5saNqwwc3A17l1LEx0RT3bMGy39a\nR2SCOVHBq4GCL9OQk93//MMr48ZSydmGkOhEXh81mo+mvlco136UuQsX8PWy76jorJ+Q9sPXi7NK\nMxeV9PR0UtPSnnjMvJSSuPh4HOzt/9/e/QdZVd53HH9/wsKyu8KiASUsGENBEzpGQKOA4o+MZgQr\nTqJWbWzjr2bGNomxMR2rmcaOmoxNmKax1VoJGBuLilUTMrHSWlk07ADC8hvUFRiBDSI/XRcQkW//\nuGdlWXeBXbh7zr3385rZ4d6zz7n3O2eW733u8zzn+xxUsTNL1/zu++9n8kMPHXSssqKC+++6i1tv\nuimVmIqJx/ztIM3N77NsWT0VFZWcfvoo5u3eRHVDbWpbLG7avJlVb7xBzcCBnDps2OFPyLOGNWtY\n39jI54cPT31i8keTf8oDySqZMaNH8eTUx45o79/auXO5/pu38P6uXfTr05enpj120EqZLFzzFatX\nc/nXv87Wbdv4YO/eg35X0bs3L86YcVDM1nlO/tahfE/qWtf9+oUX+Ns7/4Z7zjuR6vIePLp4K5VD\nz+KJKb845Hlbt23j9HPHctuZxzNyYBXzNjQxZXkzq+YvoKqyspuiP7SGNWu48Ior2PHee+zbt6/d\nNp8+/njqZ8/mxP79uzm64uEJX/uE7prUta6bO38e5w8q54SKMnp8SlwxvC/zFr522PNWNzTwmb69\nGTkwN0x0zuA+VPWENevW5TniIzds6FAaFizg6SlTuGrSJKoqK+lz3HEHLTl+r6mJK2+4ocMPBzs6\nTv4lqG7nnm6b1LWuGzyohjff28/+5Nv56q17GDTw8MNQnznpJDbuaGbHnlzS3LLrQ7Y07eGkE0/M\na7yd1bt3byZecgm/evhh3lm1ihlTp3LjdddxfL9+VFVWEhHUL13K3913X9qhFiUP+5Sg+qZ6Rn3Q\nkPkN1Uvd7t27ufSqr9G0eQMnVJSxastuZj75NKO/+MXDnnvfT3/Co9Om8IUBlazY3Mz3vn073731\n1m6I+uhFBAuXLOGZmTN5+vnnady0iaemTOGKCRPSDq3geMzfgAMbr0D3ruixrtu7dy8vzZlDU3Mz\n48eM6dQE9GuLF/PGW28x4rTTOrVZS9asfvNNXn71VW649loqKirSDqegOPkbdc2Nqa7oMbPul/oe\nvpauzmy8YmalIa8TvpKGSHpZ0gpJyyV9p4N2P5f0pqQlkkblM6ZS4hU9ZtaRfPf8PwRuj4jFko4D\nFkr6n4hY1dJA0kRgWEQMl3QO8DAwJs9xFb36pnrKN6zm7JoKyvufl3Y4ZpYxee35R8SmiFicPH4f\nWAUMatNsErlN3omIeUC/ZFN364LWtffHf/b0zCf+bdu382e33MTw0SO56LIJLF2xIu2QzEpCt63z\nl3QKMAqY1+ZXNcD6Vs83AIO7J6riUtfcyLa3pzOipiw3zFMAq3mu/saf88HaRdz1pT6cVf4ul11z\nNZu3bEk7LLOi1y3JPxnyeQa4LfkG8IkmbZ5nfwlSxqxsnEV1Qy3jawYWzPr97Tt2sHjFCm4+4wQG\n9enFxUOr+aPjy5k7f37aoZkVvbyv9pHUE/gv4FcR8Xw7TTYCQ1o9H5wcO8jkyfd+/Hjs2PMZN+6C\nYxxpYWrZaatPVRkjz7gk7XA6pXd5Ofs+2k/T3o/o17uM/RFs3/0hVZ2sXmlmObVz5zKnru6I2uZ1\nnb9ydWR/CWyNiNs7aDMR+FZETJQ0BvhZRIxp08br/NtR31RP+ZrCntT9+x//iBlP/QfjB5Xz+o6P\n6PHpk3nx2efp2bNn2qGZFbzUbvKSdB4wB1jKgaGcu4CTASLikaTdvwCXAs3AjRGxqM3rOPm3Ukx3\n6kYEz/72t8x7bQEnDzmZW66/vig3TDdLg+/wLSKbtsPaXo2M2Pl736lrZofkks5F5OPE38s1zs2s\n61zeoYC0lGkY6jINZnaUnPwLwLKt0GPD9IJc0WNm2eTkn3Eu02Bm+eDkn1EtK3rKKfwVPWaWPU7+\nGVTX3Ej127WMqCkrmLt1zaywOPlnjGvvm1l3cPLPiJYyDTUDyhjqSV0zyzMn/wyob6qnep0ndc2s\n+zj5p6hlCacndc2suzn5p6SuuZHqDZ7UNbN0OPmnoG7nHqqbl3tS18xS49o+3WzZVhj87kzO7rUL\nygakHY6ZlSj3/LtR3c49VG94jr4Dyjyxa2apcvLvJl7RY2ZZ4uSfZy7TYGZZ5OSfRy7TYGZZldcJ\nX0lTJb0jaVkHv79Q0k5J9cnPD/IZT3da2TiL6oZaxtcMdOI3s8zJd89/GvAg8Pgh2tRGxKQ8x9Ft\nXKbBzApBXpN/RLwi6ZTDNGt3f8lC5EldMysUaY/5BzBO0hJgI3BHRKxMOaZOc5kGMys0aSf/RcCQ\niNglaQLwPHBqew0nT77348djx57PuHEXdE+Eh+EyDWaWFbVz5zKnru6I2ioi8hpMMuwzMyIOW8dA\n0lrgzIjY1uZ4bNz4QX4C7KJN22Hb7lng2vtmllG9a2qIiHaH1lPt+Us6CdgcESHpbHIfRtsOd17a\nNm2HP5TVM5SdntQ1s4KU1+QvaTpwAdBf0nrgh0BPgIh4BLgKuFXSPmAXcG0+4zlW1vZqZMTOBob2\n6p92KGZmXZL3YZ9jIUvDPvVN9fR/p4GR/fp7qMfMMi2zwz6FpPWKnpFe0WNmBc7J/wh4RY+ZFRsn\n/8NY2TiLaq/oMbMi4+TfAZdpMLNi5uTfDpdpMLNi5+TfSsukbv+qMk/qmllRc/JPeFLXzEpJySf/\nljINntQ1s1JS0sm/buceqt/2pK6ZlZ6STf51zY1Ur6v1pK6ZlaSSTP51O/dQ3bzcid/MSlZe9/DN\nomVboXrdc4zotdOJ38xKVkn1/L2ix8wspySSv1f0mJkdrOiTv1f0mJl9UlEnf5dpMDNrX7538poK\nXEZuq8Z2x1ok/RyYQG4nrxsiov5o39dlGszMDi3fq32mAZd29EtJE4FhETEc+Cbw8NG+YV1zIz02\nTGdETRkjh12UqcRfO3du2iFkmq9Px3xtDs3Xp/Pymvwj4hVg+yGaTAJ+mbSdB/RLNnXvtE3bk9r7\nDbWMrxmYydU8c+rq0g4h03x9OuZrc2i+Pp2X9ph/DbC+1fMNwGDgnc68iCd1zcw6J+3kD9B2c+FO\n7SjvMg1mZp2niE7l2s6/gXQKMLO9CV9J/wbMjognk+ergQsi4p027fIbpJlZkYqIth1sIP2e/2+A\nbwFPShoD7Gib+KHj4M3MrGvyvdRzOnAB0F/SeuCHQE+AiHgkIn4naaKkBqAZuDGf8ZiZWU7eh33M\nzCx7Sq6qZ1ok9ZBUL2lm2rFkjaR1kpYm12d+2vFkjaR+kp6RtErSymSItORJOi35m2n52SnpO2nH\nVSjSHvMvJbcBK4E+aQeSQQFcGBHb0g4ko/4Z+F1EXCWpDKhKO6AsiIjXgVEAkj4FbASeSzWoAuKe\nfzeQNBiYCEzhk0tbLcfXpR2SqoHxETEVICL2RcTOlMPKoouBtyJi/WFbGuDk313+Cfg+sD/tQDIq\ngP+V9Jqkv0w7mIz5HPCupGmSFkl6VFJl2kFl0LXAf6YdRCFx8s8zSX9CrrBdPe7dduTciBhFrsDf\nX0san3ZAGVIGjAYeiojR5FbF3ZluSNkiqRdwOTAj7VgKiZN//o0DJklaC0wHvizp8ZRjypSI+EPy\n77vkxmzPTjeiTNkAbIiIBcnzZ8h9GNgBE4CFyd+PHSEn/zyLiLsiYkhEfI7cV9P/i4i/SDuurJBU\nKalP8rgK+AqwLN2osiMiNgHrJZ2aHLoYWJFiSFl0HbmOlXWCV/t0P99YcbCTgOckQe7v8YmImJVu\nSJnzbeCJZHjjLXwz5MeSDsPFgOeKOsk3eZmZlSAP+5iZlSAnfzOzEuTkb2ZWgpz8zcxKkJO/mVkJ\ncvI3MytBTv5WFCTdLWm5pCVJed9jepewpAvbK8fd0fFj8H5XSPpCq+ezJZ15rN/HSpdv8rKCJ2ks\ncBkwKiI+lHQCUJ5yWEfrq8BMYFXy3Dfk2DHlnr8Vg4HAloj4ECAitrXUC5J0ZtJrfk3Sf0samByf\nLelnybeEZZK+lBw/W9LcpILm71uVVTgsSVWSpkqal5w/KTl+g6RnJb0g6Q1JD7Q652ZJryfn/Luk\nB5MPs8uBnySvMzRpfnXS7nVJ5x2LC2ely8nfisEsYEiSFP9V0vkAknoCDwJXRsRZwDTg/uScACqS\naqJ/BUxNjq8iVz9/NLk9p3/UiTjuBl6KiHOAL5NL3i3ll88A/hQ4HbhGUo2kQcAPgHOAc4HPAxER\ndcBvgDsiYnRErEleo0fy2t9NYjPrMg/7WMGLiOZkPHw8cBHwlKQ7gYXAH5PbKwCgB9DY6tTpyfmv\nSOorqS9QDTwuaRi5D4ienQjlK8Dlku5InpcDJyev81JENAFIWgmcAgwAaiNiR3J8BtD6m0bbEuDP\nJv8uSs436zInfysKEbEfqAVqJS0DvkEu+a+IiHGdeKl7ySXqr0r6LDC7k6F8LSLebH1A0jnAB60O\nfUTu/17bcfy2yb7t71teo+V8sy7zsI8VPEmnShre6tAoYB3wOjCgZcNzST0ljWjV7prk+HnAjoh4\nD+jLgW8Hna2e+SLw8Qbikka1PGynbQALgAuSDdrLgCs5kPCbkljM8sLJ34rBccBjklZIWkJu7Pye\nZAL4KuABSYuBemBsq/P2SFoEPATcnBz7R+DHyfEeHNz7bm/FTbQ6fi/QU9JSScuBf2inzYETIxrJ\nzSnMB14F1gIt+/M+CXxf0sJWE75t39esy1zS2UqSpJeB70XEopTjqErmLMrIjen/IiJ+nWZMVhrc\n8zdL1z2S6sntXrbGid+6i3v+ZmYlyD1/M7MS5ORvZlaCnPzNzEqQk7+ZWQly8jczK0FO/mZmJej/\nAc3lc4xh6GKWAAAAAElFTkSuQmCC\n",
"text": [
"<matplotlib.figure.Figure at 0x10c8a16d8>"
]
}
],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"The perceptron update rule"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But how do we learn the weight vector? Perhaps the simplest method is to initialize all weights $w_i \\in w$ to zero, and to then apply the _perceptron update rule_, as follows. We iterate over observations, classifying each example. \n",
"\n",
"The _loss function_ for the perceptron update rule is defined as\n",
"\n",
"$$\\mathcal L(y, \\hat{y}) = \\begin{cases} 0 & \\text{if } y = \\hat{y} \\\\ 1 & \\text{otherwise}\\end{cases}~.$$\n",
"\n",
"That is, loss is 0 if the label and prediction match, and 1 otherwise. (For this reason, this loss function is sometimes called _0-1 loss_.) The _update_ $\\tau$ is given by\n",
"\n",
"$$\\tau = \\mathcal L(y, \\hat{y}) \\operatorname{sgn}(y - \\hat{y})~.$$\n",
"\n",
"In prose, then, the update is defined as\n",
"\n",
"* $\\tau = +1$ if $y = +1$ but $\\hat{y} = -1$\n",
"* $\\tau = -1$ if $y = -1$ but $\\hat{y} = +1$\n",
"* $\\tau = 0$ otherwise\n",
"\n",
"Of course, in the last case (i.e., if the observation is correctly classified) we do not bother to compute $\\tau$. To apply the update, define the weight vector $w_t$ at time $t$ according to \n",
"\n",
"$$w_t = w_{t - 1} + \\alpha~\\tau~\\phi$$\n",
"\n",
"where $\\alpha$ is _learning rate_, a positive real number. Without loss of generality, we can assume that $\\alpha = 1$ and selectively omit it henceforth. In practice, it may be set to a smaller value or gradually tapered off as part of a _learning schedule_."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Generalizations for NLP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In NLP, we generally adopt several simplifying assumptions. First, we assume that all features $\\phi_i$ are binary-valued (i.e., $\\{0, 1\\}$). Secondly, we assume that the vast majority of the features for any observation are zero (i.e., feature values are sparse). We thus conceive of $\\phi$ as a list of values which are \"activated\" for this observation.\n",
"Third, we assume that most weights have a true zero value.\n",
"\n",
"With these assumptions in place we can rewrite the decision rule and update rules as\n",
"\n",
"$$\\begin{align}\n",
"\\hat{y} &= \\begin{cases} +1 & \\text{if }\\displaystyle\\sum_{i~\\in~\\phi} w_i > 0 \\\\ -1 & \\text{otherwise}\\end{cases} \\\\\n",
" w_{i,t} &= w_{i, t - 1} + \\alpha~\\tau\n",
"\\end{align}$$\n",
"\n",
"where $i~\\in~\\phi$ means that $\\phi_i$ is \"activated\" for this observation, and $w_{i,t}$ is the $i$-th weight at time $t$. Now, let's put it all together."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# an abstract base class\n",
"\n",
"from random import Random\n",
"from collections import defaultdict\n",
"\n",
"class SGDLinearBinaryClassifier(object):\n",
" \"\"\"\n",
" Abstract base class for stochastic gradient descent-based\n",
" linear classification on sparse, hashable binary features\n",
" \"\"\"\n",
" \n",
" def __init__(self, seed=None, w_constructor=int):\n",
" self.random = Random(seed)\n",
" self.weights = defaultdict(w_constructor)\n",
" \n",
" def score(self, phi):\n",
" return sum(self.weights[phi_i] for phi_i in phi)\n",
" \n",
" # NB: prediction is now boolean rather than {-1, +1}\n",
" def predict(self, phi):\n",
" return self.score(phi) > 0\n",
" \n",
" # to be explained\n",
" \n",
" def fit(self, Y, Phi, epochs, alpha=1):\n",
" # we make `epochs` passes through the data, shuffling the \n",
" # order of presentation each time\n",
" data = list(zip(Y, Phi)) # which is a copy\n",
" for _ in range(epochs):\n",
" self.random.shuffle(data)\n",
" for (y, phi) in data:\n",
" self.fit_one(y, phi, alpha)\n",
" \n",
" def fit_one(self, y, phi, alpha=1):\n",
" raise NotImplementedError"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class BinaryPerceptronClassifier(SGDLinearBinaryClassifier):\n",
" \"\"\"\n",
" Binary linear classifier using the perceptron update rule\n",
" and stochastic gradient descent\n",
" \"\"\"\n",
" \n",
" def update(self, phi, tau, alpha=1):\n",
" \"\"\"\n",
" Generic update function, where `tau` is the penalty to be \n",
" applied and `alpha` is the learning rate\n",
" \"\"\"\n",
" for phi_i in phi:\n",
" self.weights[phi_i] += alpha * tau\n",
" \n",
" def fit_one(self, y, phi, alpha=1):\n",
" yhat = self.predict(phi)\n",
" if yhat and not y: # false positive\n",
" self.update(phi, -1, alpha)\n",
" elif y and not yhat: # false negative\n",
" self.update(phi, +1, alpha)\n",
" # else: loss, and tau, are 0"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Application: sentence segmentation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider the following sentence from the Wall St. Journal:\n",
"\n",
"_Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990._\n",
"\n",
"This sentence contains 4 periods, but only the last denotes a sentence boundary. It\u2019s obvious that the first one in _U.S._ is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like _$12.53_. But the periods at the end of _Inc._ and _U.S._ could easily have been on the left edge of a sentence boundary; it just turns out they\u2019re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb _expect_ selects two arguments (an object _its U.S. sales_ and the infinitival clause _to remain steady..._), neither of which would be satisfied if _U.S._ was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):\n",
"\n",
"_He says the big questions&mdash;\"Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?\"&mdash;aren\u2019t asked of companies coming to market._\n",
"\n",
"Much of the data used for natural language processing experiment\u2014including the enormous [Gigaword](https://catalog.ldc.upenn.edu/LDC2011T07) corpus\u2014does not include annotations for sentence boundaries providence annotations for sentence boundaries. The task of inserting appropriate sentence boundaries is known under many names, including _sentence boundary detection_ (SBD), _sentence boundary disambiguation_, _sentence segmentation_, _sentence splitting_, or _sentence tokenization_. SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I recently created my own SBD system. It slightly outperforms the current best-known SBD system, [Splitta](https://github.com/lukeorland/splitta), despite using fewer features and a much simpler learning algorithm. Candidate sentence boundaries are identified using the following nasty regular expression:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"candidate = r\"/(\\S+)\\s*((\\.+)|([!?]))(['`\\\")}\\]]*)(\\s+)\\s*(\\S+)/\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first regexp group matches the left token _L_, and the last group matches the right token _R_. If the _L_ or _R_ tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token `*NUMBER*`; a similar approach is used to convert various types of quotation marks into `*QUOTE*`. The following features are then extracted:\n",
"\n",
"* the identity of the punctuation mark\n",
"* identity of _L_ and of _R_\n",
"* the _joint_ identity of both _L_ and _R_\n",
"* casing of _L_ and of _R_\n",
"* does _L_ contain a vowel?\n",
"* does _L_ contain a period?\n",
"* length of _L_\n",
"\n",
"Now, let's put that all together."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"corpus = \"\"\"\n",
"Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.\n",
"He says the big questions--``Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?''--aren\u2019t asked of companies coming to market.\n",
"\"\"\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from re import finditer, match, search, sub\n",
"from collections import namedtuple\n",
"from string import ascii_lowercase, ascii_uppercase, digits\n",
"\n",
"from nltk import word_tokenize\n",
"\n",
"BUFSIZE = 32 # for reading in left and right contexts...see below\n",
"\n",
"# character classes\n",
"\n",
"DIGITS = frozenset(digits)\n",
"LOWERCASE = frozenset(ascii_lowercase)\n",
"UPPERCASE = frozenset(ascii_uppercase)\n",
"LETTERS = LOWERCASE | UPPERCASE\n",
"VOWELS = frozenset(\"AEIOUY\")\n",
"\n",
"# token classes\n",
"\n",
"QUOTE_TOKEN = \"*QUOTE*\"\n",
"NUMBER_TOKEN = \"*NUMBER*\"\n",
"\n",
"# regexes\n",
"\n",
"PUNCT = r\"((\\.+)|([!?]))\"\n",
"TARGET = PUNCT + r\"(['`\\\")}\\]]*)(\\s+)\"\n",
"\n",
"LTOKEN = r\"(\\S+)\\s*$\"\n",
"RTOKEN = r\"^\\s*(\\S+)\"\n",
"NEWLINE = r\"^\\s*[\\r\\n]+\\s*$\"\n",
"\n",
"NUMBER = r\"^(\\-?\\$?)(\\d+(\\,\\d{3})*([\\-\\/\\:\\.]\\d+)?)\\%?$\"\n",
"QUOTE = r\"^['`\\\"]+$\"\n",
"\n",
"Observation = namedtuple(\"Observation\", [\"L\", \"P\", \"R\", \"B\", \"end\"])\n",
"\n",
"\n",
"def candidates(text):\n",
" \"\"\"\n",
" Given a `text` string, get candidates and context for feature\n",
" extraction and classification\n",
" \"\"\"\n",
" for Pmatch in finditer(TARGET, text):\n",
" # the punctuation mark itself\n",
" P = Pmatch.group(1)\n",
" # is it a boundary?\n",
" B = bool(match(NEWLINE, Pmatch.group(5)))\n",
" # L & R\n",
" start = Pmatch.start()\n",
" end = Pmatch.end()\n",
" Lmatch = search(LTOKEN, text[max(0, start - BUFSIZE):start])\n",
" if not Lmatch: # this happens when a line begins with '.'\n",
" continue\n",
" L = word_tokenize(\" \" + Lmatch.group(1))[-1]\n",
" Rmatch = search(RTOKEN, text[end:end + BUFSIZE])\n",
" if not Rmatch: # this happens at the end of the file, usually\n",
" continue\n",
" R = word_tokenize(Rmatch.group(1) + \" \")[0]\n",
" # complete observation\n",
" yield Observation(L, P, R, B, end)\n",
" \n",
"for candidate in candidates(corpus):\n",
" print(candidate)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Observation(L='Inc', P='.', R='said', B=False, end=29)\n",
"Observation(L='U.S', P='.', R='sales', B=False, end=54)\n",
"Observation(L='1990', P='.', R='He', B=True, end=106)\n",
"Observation(L='investments', P='?', R='Have', B=False, end=199)\n",
"Observation(L='sector', P='?', R='What', B=False, end=257)\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feature extraction code is always hairy, and this is no exception."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def fit_case(token):\n",
" \"\"\"\n",
" `True` iff the token starts with an uppercase character,\n",
" `False`` if it starts with a lowercase character, and \n",
" `None` if it does not start with an alphabetic character.\n",
" \"\"\"\n",
" if not token:\n",
" return\n",
" if token[0] in UPPERCASE:\n",
" return True\n",
" if token[0] in LOWERCASE:\n",
" return False\n",
"\n",
"\n",
"def extract_features(L, P, R):\n",
" \"\"\"\n",
" Given left context `L`, punctuation mark `P`, and right context \n",
" `R`, extract features.\n",
" \"\"\"\n",
" yield \"*bias*\"\n",
" # L feature(s)\n",
" if match(QUOTE, L):\n",
" L = QUOTE_TOKEN\n",
" elif match(NUMBER, L):\n",
" L = NUMBER_TOKEN\n",
" else:\n",
" yield \"len(L)={}\".format(min(10, len(L)))\n",
" if \".\" in L:\n",
" yield \"(L:period)\"\n",
" if fit_case(L):\n",
" yield \"(L_0:upper)\"\n",
" L = L.upper()\n",
" if not any(char in VOWELS for char in L):\n",
" yield \"(L:no-vowel)\"\n",
" L_feat = \"L='{}'\".format(L)\n",
" yield L_feat\n",
" # P feature(s)\n",
" yield \"P='{}'\".format(P)\n",
" # R feature(s)\n",
" if match(QUOTE, R):\n",
" R = QUOTE_TOKEN\n",
" elif match(NUMBER, R):\n",
" R = NUMBER_TOKEN\n",
" else:\n",
" if fit_case(R):\n",
" yield \"(R_0:upper)\"\n",
" R = R.upper()\n",
" R_feat = \"R='{}'\".format(R)\n",
" yield R_feat\n",
" # the combined L,R feature\n",
" yield \"{},{}\".format(L_feat, R_feat)\n",
" \n",
" \n",
"(L, P, R, _, _) = next(candidates(corpus))\n",
"phi = extract_features(L, P, R)\n",
"for phi_i in phi:\n",
" print(phi_i)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"*bias*\n",
"len(L)=3\n",
"L='INC'\n",
"P='.'\n",
"R='SAID'\n",
"L='INC',R='SAID'\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This system, dubbed **Detector Morse**, performs exceptionally well on the \u201cstandard split\u201d, with an accuracy of .9955, an _F_-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta modified to handle ellipses, question marks, etc.; this forked system produced 55 errors."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Averaging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One weakness of the \"vanilla\" perceptron (described above) is that it lacks _stability_; a very last example may greatly alter the weight vector, resulting in poor generalizability. One strategy to address this is to use the _pocket_ trick, i.e., store a copy of the best $w$ so far (\"in the pocket\"), and use the pocket weights for inference once training is complete. Freund & Schapire propose another method, which they refer to as _voting_. Rather than modifying the weight vector $w$ in place, they keep a copy of every weight vector $w_0, w_1,...w_t$. Once training is complete, each weight vector is allowed to \"vote\" on the prediction, as in\n",
"\n",
"$$\\hat{y} = \\operatorname{sgn}\\left(\\displaystyle\\sum_{i=0}^t~\\operatorname{sgn}(w_i~\\cdot~\\phi)\\right)~.$$\n",
"\n",
"Intuitively, this results in greater stability. However, it has a much higher memory complexity than the vanilla perceptron&mdash;a na\u00efve implementation has much greater memory requirements&mdash;as well as greater time complexity at inference time. Freund & Schapire therefore suggest an alternative to voting, naming _averaging_ of the weights at inference time, as in\n",
"\n",
"$$\\hat{y} = \\operatorname{sgn}(\\bar{w}~\\cdot~\\phi)~.$$\n",
"\n",
"where $\\bar{w}$ represents the averaged weight vector. The averaged perceptron has the same space and time complexity as the vanilla perceptron, but Freund & Schapire find that averaging performs just as well as voting. As a result, averaging is considered a \"best practice\" for most applications of linear classification.\n",
"\n",
"There are two ways to implement averaging. First, the classifier can preserve two weight vectors: $w_t$, the current weights&mdash;and $\\sum_i^t w$, the itemwise sum of all weights so far. However the latter term may overflow when when using fixed sized integers to represent weights. (Fortunately, this is not a problem for us in Python, which uses arbitrary precision integers!) One alternative is to employ an stable online averaging algorithm:\n",
"\n",
"$$\\bar{w}_t = \\bar{w}_{t - 1} + \\frac{w_t - \\bar{w}_{t - 1}}{t}$$"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"More information"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The resulting system, called [DetectorMorse](https://github.com/cslu-nlp/DetectorMorse/), is available on GitHub with a BSD-like license. The method is described in more detail in a [blog post](http://sonny.cslu.ohsu.edu/~gormanky/blog/simpler-sentence-boundary-detection/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Super-cool list of linear model jargon to learn about later:\n",
" \n",
"* the _hash trick_\n",
"* non-linear updates with _the winnow_\n",
"* adaptations for _multiclass classification_\n",
"* _uniform updates_ and _proportional updates_\n",
"* _ranking_ and _reranking_ with linear classifiers\n",
"* non-linear classification with _the kernel trick_\n",
"* _maximum-margin classifiers_ such as _support vector machines_ (SVMs)\n",
"\n",
"For even more detail, see my notebook [\"Linear classifiers for NLP\"](http://nbviewer.ipython.org/url/cslu.ohsu.edu/~gormanky/courses/CS662/Notebooks/CS-562-662-linear-classifiers-for-NLP.ipynb), which was prepared for my [CS 562/662 (Natural Language Processing)](http://cslu.ohsu.edu/~gormanky/courses/CS662/) class."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment