AndreiDuma · January 8, 2016 11:52
diff --git a/Lab11-NLP-Skel (1).ipynb b/Lab11-NLP-Skel (1).ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prelucrarea Limbajului Natural: Analiza Sentimentelor\n",
    " - Tudor Berariu\n",
    " - Andrei Olaru\n",
    "\n",
    "Scopul acestui laborator îl reprezintă rezolvarea unei probleme ce implică analiza unor documente în limbaj natural și învățarea unui algoritm simplu de clasificare: **Naive Bayes**.\n",
    "\n",
    "## Analiza Sentimentelor\n",
    "\n",
    "O serie de probleme de inteligență artificială presupun asocierea unei clase unui document în limbaj natural. Exemple de astfel de probleme sunt: **clasificarea** email-urilor în *spam* sau *ham* sau a recenziilor unor filme în *pozitive* sau *negative*. În laboratorul de astăzi vom aborda problema din urmă.\n",
    "\n",
    "Folosind setul de date de aici: http://www.cs.cornell.edu/people/pabo/movie-review-data/ (2000 de recenzii de film), vom construi un model care să discrimineze între recenziile pozitive și recenziile negative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Algoritmul Naive Bayes\n",
    "\n",
    "### Clasificare\n",
    "\n",
    "Având un set de date $\\langle \\mathbf{X}, \\mathbf{T} \\rangle$ compus din $N$ exemple $\\mathbf{x}^{(i)}$, $1 \\le i \\le N$, descrise prin $k$ atribute $(x^{(i)}_1, x^{(i)}_2, \\ldots, x^{(i)}_k)$ și etichetate cu o clasă $t^{(i)} \\in \\mathcal{C}$, se cere construirea unui clasificator care să eticheteze exemple noi.\n",
    "\n",
    "### Naive Bayes\n",
    "\n",
    "**Naive Bayes** reprezintă o *metodă statistică inductivă* de clasificare, bazată pe Teorema lui Bayes pentru exprimarea relației dintre probabilitatea *a priori* și probabilitatea *posterioară* ale unei ipoteze.\n",
    "\n",
    "$$P(c \\vert \\mathbf{x}) = \\frac{P(\\mathbf{x} \\vert c) \\cdot P(c)}{P(\\mathbf{x})}$$\n",
    "\n",
    " - $P(c)$ reprezintă probabilitatea *a priori* a clasei $c$\n",
    " - $P(c \\vert \\mathbf{x})$ reprezintă probabilitatea *a posteriori* (după observarea lui $\\mathbf{x}$)\n",
    " - $P(\\mathbf{x} \\vert c)$ reprezitnă probabilitatea ca $\\mathbf{x}$ să aparțină clasei $c$ (*verosimilitatea*)\n",
    " \n",
    "Un clasificator **Naive Bayes** funcționează pe principiul verosimilității maxime (eng. *maximum likelihood*), deci alege clasa $c$ pentru care probabilitatea $P(c \\vert x)$ este maximă:\n",
    "\n",
    "$$c_{MAP} = \\underset{c \\in \\mathcal{C}}{\\arg\\max} P(c \\vert \\mathbf{x}) = \\underset{c \\in \\mathcal{C}}{\\arg\\max} \\frac{P(\\mathbf{x} \\vert c) \\cdot P(c)}{P(x)} = \\underset{c \\in \\mathcal{C}}{\\arg\\max} P(\\mathbf{x} \\vert c) \\cdot P(c)$$\n",
    "\n",
    "Cum fiecare exemplu $\\mathbf{x}$ este descris prin $K$ atribute:\n",
    "\n",
    "$$c_{MAP} = \\underset{c \\in \\mathcal{C}}{\\arg\\max} P(x_1, x_2, \\ldots x_K \\vert c) \\cdot P(c)$$\n",
    "\n",
    "Algoritmul **Naive Bayes** face o presupunere simplificatoare, și anume, că atributele unui exemplu sunt *condițional independente* odată ce clasa este cunoscută:\n",
    "\n",
    "$$P(\\mathbf{x} \\vert c) = \\displaystyle\\prod_i P(x_i \\vert c)$$\n",
    "\n",
    "Astfel clasa pe care o prezice un clasificator **Naive Bayes** este:\n",
    "\n",
    "$$c_{NB} = \\underset{c \\in \\mathcal{C}}{\\arg\\max} P(c) \\cdot \\displaystyle \\prod_{i}^{K} P(x_i \\vert c)$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clasificarea documentelor\n",
    "\n",
    "Pentru clasificare documentele vor fi reprezentate prin vectori binari de lungimea vocabularului (eng. *bag of words*). Practic fiecare document va avea 1 pe pozițiile corspunzătoare cuvintelor pe care le conține și 0 pe toate celelalte poziții. Dimensiunea unui exemplu $\\mathbf{x}$ este, deci, numărul de cuvinte diferite din setul de date.\n",
    "\n",
    "### Estimarea parametrilor modelului Naive Bayes\n",
    "\n",
    "Probabilitatea _a priori_ pentru o clasă $c \\in \\mathcal{C}$:\n",
    "\n",
    "$$P(c) = \\frac{\\#\\text{ docs in class }c}{\\#\\text{ total docs}}$$\n",
    "\n",
    "$P(x_i \\vert c)$ va reprezenta probabilitatea de a apărea cuvântul $x_i$ într-un document din clasa $c$ și o vom estima cu raportul dintre numărul de apariții ale cuvântului $x_i$ în documentele din clasa $c$ și numărul total de cuvinte ale acelor documente:\n",
    "\n",
    "$$P(x_i \\vert c) = \\frac{\\#\\text{ aparitii ale lui } x_i \\text{ in documente din clasa } c}{\\#\\text{ numar total de cuvinte in documentele din clasa } c}$$\n",
    "\n",
    "Deoarece este posibil ca un cuvant _rar_ ce apare într-un exemplu de test să nu se găsească deloc într-una din clase, se poate întâmpla ca un astfel de _accident_ să anuleze complet o probabilitate. Dacă un singur factor al unui produs este zero, atunci produsul devine zero. De aceea vom folosi netezire Laplace (eng. _Laplace smoothing_):\n",
    "\n",
    "$$P(x_i \\vert c) = \\frac{\\#\\text{ aparitii ale lui } x_i \\text{ in documente din clasa } c + \\alpha}{\\#\\text{ numar total de cuvinte in documentele din clasa } c + \\vert Voc \\vert \\cdot \\alpha}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setul de date\n",
    "\n",
    " 1. Descărcați setul de date **polarity dataset v2.0** de aici http://www.cs.cornell.edu/people/pabo/movie-review-data/\n",
    " 2. Dezarhivați fișierul **review_polarity.tar.gz** și rearhivați directorul review_polarity ca zip.\n",
    " 3. Plasați / încărcați **review_polarity.zip** în directorul de lucru."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Recenzii pozitive: 1000; Recenzii negative: 1000\n"
     ]
    }
   ],
   "source": [
    "import zipfile\n",
    "\n",
    "zipi = zipfile.ZipFile(\"review_polarity.zip\")\n",
    "\n",
    "pos_files = [f for f in zipi.namelist() if '/pos/cv' in f]\n",
    "neg_files = [f for f in zipi.namelist() if '/neg/cv' in f]\n",
    "\n",
    "pos_files.sort()\n",
    "neg_files.sort()\n",
    "\n",
    "print(\"Recenzii pozitive: \" + str(len(pos_files)) + \"; Recenzii negative: \" + str(len(neg_files)))\n",
    "\n",
    "# Raspunsul asteptat: \"Recenzii pozitive: 1000; Recenzii negative: 1000\"\n",
    "assert(len(pos_files) == 1000 and len(neg_files) == 1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setul de antrenare și setul de testare\n",
    "\n",
    "Vom folosi 80% din datele din fiecare clasă pentru antrenare și 20% pentru testare."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tr_pos_no = int(.8 * len(pos_files))\n",
    "tr_neg_no = int(.8 * len(neg_files))\n",
    "\n",
    "from random import shuffle\n",
    "shuffle(pos_files)\n",
    "shuffle(neg_files)\n",
    "\n",
    "pos_train = pos_files[:tr_pos_no] # Recenzii pozitive pentru antrenare\n",
    "pos_test  = pos_files[tr_pos_no:] # Recenzii pozitive pentru testare\n",
    "neg_train = neg_files[:tr_neg_no] # Recenzii negative pentru antrenare\n",
    "neg_test  = neg_files[tr_neg_no:] # Recenzii negative pentru testare"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Construirea vocabularului și calculul parametrilor\n",
    "\n",
    "Funcția `parse_document` primește calea către unul dinte fișierele aflate în arhivă și întoarce cuvintele din acest fișier (exceptând cuvintele cu o singură literă și pe cele din lista `STOP_WORDS`. Implementați funcția `count_words` astfel încât să întoarcă un dicționar cu o intrare pentru fiecare cuvânt care să conțină un tuplu cu două valori: numărul de apariții ale acelui cuvânt în rencezii pozitive și numărul de apariții în recenzii negative. În afara acelui dicționar se vor întoarce și numărul total de cuvinte din recenziile pozitive și numărul total de cuvinte din recenziile negative."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "FileNotFoundError",
     "evalue": "[Errno 2] No such file or directory: 'stop_words'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-46-624fe936edd0>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0mSTOP_WORDS\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mSTOP_WORDS\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[0mline\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstrip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mline\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"stop_words\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      3\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mre\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'stop_words'"
     ]
    }
   ],
   "source": [
    "STOP_WORDS = []\n",
    "STOP_WORDS = [line.strip() for line in open(\"stop_words\")]\n",
    "\n",
    "import re\n",
    "\n",
    "def parse_document(path):\n",
    "    for word in re.findall(r\"[-\\w']+\", zipi.read(path).decode(\"utf-8\")):\n",
    "        if len(word) > 1 and word not in STOP_WORDS:\n",
    "            yield word\n",
    "\n",
    "def count_words():\n",
    "    vocabulary = {}\n",
    "    pos_words_no = 0\n",
    "    neg_words_no = 0\n",
    "    \n",
    "    # ------------------------------------------------------\n",
    "    # <TODO 1> numrati aparitiile in documente pozitive si\n",
    "    # in documente negative ale fiecarui cuvant, precum si numarul total\n",
    "    # de cuvinte din fiecare tip de recenzie\n",
    "    \n",
    "    # ------------------------------------------------------\n",
    "    for w in [w for f in pos_train for w in parse_document(f)]:\n",
    "        pos_words_no += 1\n",
    "        p, n = vocabulary.get(w, (0, 0))\n",
    "        vocabulary[w] = (p + 1, n)\n",
    "    for w in [w for f in neg_train for w in parse_document(f)]:\n",
    "        neg_words_no += 1\n",
    "        p, n = vocabulary.get(w, (0, 0))\n",
    "        vocabulary[w] = (p, n + 1)\n",
    "    \n",
    "    return (vocabulary, pos_words_no, neg_words_no)\n",
    "\n",
    "# -- VERIFICARE --\n",
    "(voc, p_no, n_no) = count_words()\n",
    "print(\"Vocabularul are \", len(voc), \" cuvinte.\")\n",
    "print(p_no, \" cuvinte in recenziile pozitive si \", n_no, \" cuvinte in recenziile negative\")\n",
    "print(\"Cuvantul 'beautiful' are \", voc.get(\"beautiful\", (0, 0)), \" aparitii.\")\n",
    "print(\"Cuvantul 'awful' are \", voc.get(\"awful\", (0, 0)), \" aparitii.\")\n",
    "\n",
    "# Daca se comentează liniile care reordonează aleator listele cu exemplele pozitive și negative,\n",
    "# rezultatul așteptat este:\n",
    "#\n",
    "# Vocabularul are  44895  cuvinte.\n",
    "# 526267  cuvinte in recenziile pozitive si  469812  cuvinte in recenziile negative\n",
    "# Cuvantul 'beautiful' are  (165, 75)  aparitii.\n",
    "# Cuvantul 'awful' are  (16, 89)  aparitii."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predicția sentimentului unei recenzii noi\n",
    "\n",
    "Implementați funcția `predict` care primește parametrii `params` (vocabularul, numărul total de cuvinte din recenziile pozitive și numărul total de cuvinte din recenziile negative) și `path` (calea către o recenzie din cadrul arhivei) și întoarce clasa mai probabilă și logaritmul acelei probabilități. Al treilea argument (opțional) al funcției `predict` este coeficientul pentru netezire Laplace.\n",
    "\n",
    "Așa cum a fost explicat anterior, clasa pe care o prezice un clasificator **Naive Bayes** este dată de următoarea expresie:\n",
    "\n",
    "$$c_{NB} = \\underset{c \\in \\mathcal{C}}{\\arg\\max} P(c) \\cdot \\displaystyle \\prod_{i}^{K} P(x_i \\vert c)$$\n",
    "\n",
    "Pentru a evita lucrul cu numere foarte mici ce pot rezulta din produsul multor valori subunitare, vom logaritma expresiile date:\n",
    "\n",
    "$$c_{NB} = \\underset{c \\in \\mathcal{C}}{\\arg\\max} \\log(P(c)) + \\displaystyle\\sum_{i}^{K} \\log(P(x_i \\vert c))$$\n",
    "\n",
    "Pentru calculul probabilitatilor, vedeti sectiunea \"Estimarea parametrilor modelului Naive Bayes\", mai sus. În cod, `log_pos` și `log_neg` trebuie însumate cu logaritmul pentru fiecare exemplu -- $ \\log(P(c)) $ este deja adunat.\n",
    "\n",
    "De aceea vom folosi netezire Laplace (eng. _Laplace smoothing_):\n",
    "\n",
    "$$P(x_i \\vert c) = \\frac{\\#\\text{ aparitii ale lui } x_i \\text{ in documente din clasa } c + \\alpha}{\\#\\text{ numar total de cuvinte in documentele din clasa } c + \\vert Voc \\vert \\cdot \\alpha}$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in the opening shot of midnight cowboy , we see a close-up of a blank movie screen at a drive-in . \n",
      "we hear in the soundtrack human cries and the stomping of horses' hooves . \n",
      "without an image projected onto the screen , the audience unerringly identifies the familiar sound of cowboys chasing indians and can spontaneously fill in the blank screen with images of old westerns in our mind's eye . \n",
      "even without having seen a cowboys and indians movie , somehow the cliched images of them seem to have found their way into our mental schema . \n",
      "but do cowboys really exist , or are they merely hollywood images personified by john wayne and gary cooper ? \n",
      "exploring this theme , director john schlesinger uses the idea of the cowboy as a metaphor for the american dream , an equally cliched yet ambiguous concept . \n",
      "is the ease at which salvation and success can be attained in america a hallmark of its experience or an urban legend ? \n",
      "midnight cowboy suggests that the american dream , like image of the cowboy , is merely a myth . \n",
      "as joe buck migrates from place to place , he finds neither redemption nor reward in his attempt to create a life for himself , only further degeneration . \n",
      "during the opening credits , joe walks past an abandoned theater whose decrepit marquee reads `john wayne : the alamo . ' \n",
      "as joe is on the bus listening to a radio talk show , a lady on the air describes her ideal man as `gary cooper ? but he's dead . ' \n",
      "a troubled expression comes across joe's face , as he wonders where have all the cowboys gone . \n",
      "having adopted the image of a cowboy since youth , joe now finds himself deserted by the persona he tried to embody . \n",
      "joe's persistence in playing the act of the cowboy serves as an analogue to his american dream . \n",
      "he romanticizes about making it in the big city , but his dreams will desert him as he is forced to compromise his ideals for sustenance . \n",
      "by the end of midnight cowboy , joe buck loses everything and gains nothing . \n",
      "just as the audience can picture cowboys chasing indians on a blank screen , we can also conjure up scenes from pretty woman as paradigms of american redemption and success . \n",
      "but how realistic are these ideals ? \n",
      "joe had raped and been raped in texas . \n",
      "the scars of his troubled past prompt him to migrate to new york , but he does not know that his aspirations to be a cowboy hero will fail him there just as they had in texas . \n",
      "alongside the dream of success is the dream of salvation . \n",
      "the ability to pack up one's belongings and start anew seems to be an exclusive american convention . \n",
      "schlesinger provides us with strong hints as to joe's abusive and abused past with flashbacks of improper relationships with crazy anne and granny . \n",
      "we understand that joe adopts the fa ? ade of a cowboy , a symbol of virility and gallantry , as an attempt to neutralize his shame . \n",
      "he runs from his past only to be sexually defiled this time by his homosexual experiences in new york . \n",
      "in the scene at the diner which foreshadows joe's encounter with the gay student , joe buck spills ketchup on himself . \n",
      "standing up , we see the ketchup has made a red stain running from the crotch of his pants down his thigh . \n",
      "schlesinger visually depicts the degeneration of joe's virility by eliciting an image of bleeding genitals , signifying emasculation . \n",
      "beyond the symbol of castration , the scene may also connote the bleeding of a virgin's first sexual encounter , a reference to joe's first homosexual liaison . \n",
      "the fact that the idea of a bleeding virgin is relegated only to females furthers the imagery of joe's emasculation . \n",
      "it is ironic that joe has trouble prospecting for female clients , but effortlessly attracts men . \n",
      "joe believes his broncobuster getup is emblematic of his masculinity ; new yorkers see his ensemble as camp and `faggot stuff . ' \n",
      "there are two predominant images of new york . \n",
      "the first is that new york is the rich , cosmopolitan city where hope and opportunity are symbolized by the tall skyscrapers and the statue of liberty . \n",
      "the other new york is travis bickle's new york , a seedy , corruptive hell on earth . \n",
      "joe envisions new york as the former , but is presented with the latter . \n",
      "mirroring the irony in which joe envisions his cowboy attire as masculine , he mistakenly buys into the fable that new york is filled with lonely women neglected by gay men . \n",
      "joe thinks he is performing a great service for new york , but the city rapes him of his pride and possessions . \n",
      "the people steal joe's money , the landlord confiscates his luggage , and the homosexuals rob him of his dignity . \n",
      "what has become of joe's american dream ? \n",
      "schlesinger responds to this question with the scene at the party . \n",
      "joe gets invited to a shindig of sorts and at the gathering is exposed to a dizzying array of food , drugs , and sex . \n",
      "at the party , all of joe and ratzo's desires are made flesh ; joe flirts successfully with women and ratzo loads up on free salami . \n",
      "contrasting joe's daily struggles , shots of warhol's crew display wanton indulgence . \n",
      "there is an irreverence in the partygoers' attitude ; we see a shot of a woman kowtowing to nothing in particular , orgies breaking out in the periphery , and drugs passed around like party favors . \n",
      "the party makes a mockery of joe' s ideals . \n",
      "joe believed that hard work and persistence were the elements for success in america ; scenes of the party and his rendezvous with shirley suggest that it is the idle who profit from joe's toils . \n",
      "the american dream , schlesinger suggests , is merely a proletarian fantasy , for those who are content no longer dream , but become indolent . \n",
      "as joe heads to miami , all that was significant of the cowboy image has left him . \n",
      "his masculinity is compromised and his morality is relinquished . \n",
      "for joe , nothing is left of the cowboy hero and commensurately , he surrenders the identity . \n",
      "tossing his boots into the garbage , he returns to the bus for the last leg of his journey to miami . \n",
      "the final shot of midnight cowboy shows joe inside the bus , more introspective , taking only a few glances outside the window . \n",
      "instead of the frequent pov shots of joe excitedly looking out of the bus on his way to new york , schlesinger sets up this final shot from the exterior of the bus looking in through the window at joe . \n",
      "reflections of the palm trees ratzo so raved about run across the bus' window with joe hardly taking notice . \n",
      "the scenery of miami no longer exacts the same excitement from joe as before . \n",
      "the world seems smaller to joe now ; the termination of his journey coincides with the termination of his american dream . \n",
      "no longer does joe aspire to be the enterprising gigolo ; he resolves to return to a normal job and resign to basic means . \n",
      "midnight cowboy presents two familiar incarnations of the american dream . \n",
      "there is the frontier fantasy that if you are brave enough to repel a few indians , you can set up a ranch out west and raise a beautiful family . \n",
      "then there is the jay gatsby dream that a man of humble stock , with perseverance , can make a fortune in the big city . \n",
      "joe's attempt to realize these dreams robs him of his innocence in texas and morality in new york . \n",
      "during his search for an intangible paradise , joe ends up raping a girl and killing a man . \n",
      "an allegory of chasing the promise of the american dream , joe buck's progressive moral atrophy is a warning against the pursuit of illusory icons . \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "('pos', -9356.037538324854)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from math import log\n",
    "\n",
    "def predict(params, path, alpha = 1):\n",
    "    (vocabulary, pos_words_no, neg_words_no) = params\n",
    "    log_pos = log(0.5)\n",
    "    log_neg = log(0.5)\n",
    "    \n",
    "    # ----------------------------------------------------------------------\n",
    "    # <TODO 2> Calculul logaritmilor probabilităților\n",
    "    \n",
    "    # ----------------------------------------------------------------------\n",
    "    vocsize = len(vocabulary)\n",
    "    for w in parse_document(path):\n",
    "        if w not in vocabulary:\n",
    "            log_pos += log(alpha / (pos_words_no + vocsize * alpha))\n",
    "            log_neg += log(alpha / (neg_words_no + vocsize * alpha))\n",
    "            continue\n",
    "        pos, neg = vocabulary[w]\n",
    "        log_pos += log((pos + alpha) / (pos_words_no + vocsize * alpha))\n",
    "        log_neg += log((neg + alpha) / (neg_words_no + vocsize * alpha))\n",
    "    \n",
    "    if log_pos > log_neg:\n",
    "        return \"pos\", log_pos\n",
    "    else:\n",
    "        return \"neg\", log_neg\n",
    "\n",
    "# -- VERIFICARE --\n",
    "print(zipi.read(pos_test[14]).decode(\"utf-8\"))\n",
    "predict(count_words(), pos_test[14])\n",
    "\n",
    "# Daca se comentează liniile care reordonează aleator listele cu exemplele pozitive și negative,\n",
    "# rezultatul așteptat este:\n",
    "#\n",
    "# ('pos', -1790.27088356391) pentru un film cu Hugh Grant și Julia Roberts (o mizerie siropoasă)\n",
    "#\n",
    "# Recenzia este clasificată corect ca fiind pozitivă."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Evaluarea modelului\n",
    "\n",
    "Pentru a evalua modelul vom calcula acuratețea acestuia și matricea de confuzie, folosind datele de test (`pos_test` și `neg_test`).\n",
    "\n",
    "[Vedeți aici despre matricea de confuzie](https://en.wikipedia.org/wiki/Confusion_matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Acuratetea pe setul de date de test:  82.75 %. Matricea de confuzie:\n",
      "    |     pos      |     neg    \n",
      "--- + ------------ + ------------\n",
      "pos |     164      |      36    \n",
      "neg |      33      |     167    \n"
     ]
    }
   ],
   "source": [
    "def evaluate(params, prediction_func):\n",
    "    conf_matrix = {}\n",
    "    conf_matrix[\"pos\"] = {\"pos\": 0, \"neg\": 0}\n",
    "    conf_matrix[\"neg\"] = {\"pos\": 0, \"neg\": 0}\n",
    "    \n",
    "    # ----------------------------------------------------------------------\n",
    "    # <TODO 3> : Calcularea acurateței și a matricei de confuzie\n",
    "    \n",
    "    #------------------------------------------------------------\n",
    "    for f in pos_test:\n",
    "        conf_matrix[\"pos\"][prediction_func(params, f)[0]] += 1\n",
    "    for f in neg_test:\n",
    "        conf_matrix[\"neg\"][prediction_func(params, f)[0]] += 1\n",
    "    \n",
    "    accuracy = (conf_matrix[\"pos\"][\"pos\"] + conf_matrix[\"neg\"][\"neg\"]) / sum([val for row in conf_matrix.values() for val in row.values()])\n",
    "    \n",
    "    return accuracy, conf_matrix\n",
    "# -----------------------------------------------------------\n",
    "\n",
    "def print_confusion_matrix(cm):\n",
    "    print(\"    | \", \"{0:^10}\".format(\"pos\"), \" | \", \"{0:^10}\".format(\"neg\"))\n",
    "    print(\"{0:-^3}\".format(\"\"), \"+\", \"{0:-^12}\".format(\"\"), \"+\", \"{0:-^12}\".format(\"-\", fill=\"-\"))\n",
    "    print(\"pos | \", \"{0:^10}\".format(cm[\"pos\"][\"pos\"]), \" | \", \"{0:^10}\".format(cm[\"pos\"][\"neg\"]))\n",
    "    print(\"neg | \", \"{0:^10}\".format(cm[\"neg\"][\"pos\"]), \" | \", \"{0:^10}\".format(cm[\"neg\"][\"neg\"]))\n",
    "\n",
    "\n",
    "# -- VERIFICARE --\n",
    "(acc_words, cm_words) = evaluate(count_words(), predict)\n",
    "print(\"Acuratetea pe setul de date de test: \", acc_words * 100, \"%. Matricea de confuzie:\")\n",
    "print_confusion_matrix(cm_words)\n",
    "\n",
    "# Daca se comentează liniile care reordonează aleator listele cu exemplele pozitive și negative,\n",
    "# rezultatul așteptat este:\n",
    "#\n",
    "# Acuratetea pe setul de date de test:  80.5 %. Matricea de confuzie:\n",
    "#     |     pos      |     neg    \n",
    "# --- + ------------ + ------------\n",
    "# pos |     155      |      45    \n",
    "# neg |      33      |     167"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Un model mai bun? Să folosim bigrame? Da!\n",
    "\n",
    "Implementați funcția `count_bigrams`, similară cu `count_words`, doar că de data aceasta dicționarul va conține bigramele din text. Funcția va întoarce tot trei elemente: dicționarul cu aparițiile în recenzii pozitive și în recenzii negative ale bigramelor, numărul total de bigrame din recenziile pozitive și numărul total de bigrame din recenziile negative.\n",
    "\n",
    "Salvați o bigramă prin concatenarea primului cuvânt, semnului \":\" și a celui de-al doilea cuvânt. De exemplu: `\"texas:ranger\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tabelul are  427828  bigrame.\n",
      "528651  bigrame in recenziile pozitive si  464426  bigrame in recenziile negative\n",
      "Bigrama 'beautiful actress' are  (1, 0)  aparitii.\n",
      "Bigrama 'awful movie' are  (1, 3)  aparitii.\n"
     ]
    }
   ],
   "source": [
    "def bgr(f):\n",
    "    words = list(parse_document(f))\n",
    "    return [u + ':' + w for (u, w) in zip(words[:-1], words[1:])]\n",
    "\n",
    "def count_bigrams():\n",
    "    bigrams = {}\n",
    "    pos_bigrams_no = 0\n",
    "    neg_bigrams_no = 0\n",
    "\n",
    "    # ----------------------------------------------------------------------\n",
    "    # <TODO 4>: Numarati bigramele\n",
    "    \n",
    "    #-----------------------------------------------\n",
    "    pos_bigrams = [b for f in pos_train for b in bgr(f)]\n",
    "    for b in pos_bigrams:\n",
    "        pos_bigrams_no += 1\n",
    "        p, n = bigrams.get(b, (0, 0))\n",
    "        bigrams[b] = (p + 1, n)\n",
    "    neg_bigrams = [b for f in neg_train for b in bgr(f)]\n",
    "    for b in neg_bigrams:\n",
    "        neg_bigrams_no += 1\n",
    "        p, n = bigrams.get(b, (0, 0))\n",
    "        bigrams[b] = (p, n + 1)\n",
    "    \n",
    "    return bigrams, pos_bigrams_no, neg_bigrams_no\n",
    "\n",
    "# -- VERIFICARE --\n",
    "(big, pos_b, neg_b) = count_bigrams()\n",
    "print(\"Tabelul are \", len(big), \" bigrame.\")\n",
    "print(pos_b, \" bigrame in recenziile pozitive si \", neg_b, \" bigrame in recenziile negative\")\n",
    "print(\"Bigrama 'beautiful actress' are \", big.get(\"beautiful:actress\", (0, 0)), \" aparitii.\")\n",
    "print(\"Bigrama 'awful movie' are \", big.get(\"awful:movie\", (0, 0)), \" aparitii.\")\n",
    "\n",
    "# Daca se comentează liniile care reordonează aleator listele cu exemplele pozitive și negative,\n",
    "# rezultatul așteptat este:\n",
    "#\n",
    "# Tabelul are  428997  bigrame.\n",
    "# 525467  bigrame in recenziile pozitive si  469012  bigrame in recenziile negative\n",
    "# Bigrama 'beautiful actress' are  (2, 0)  aparitii.\n",
    "# Bigrama 'awful movie' are  (1, 4)  aparitii."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Funcția de predicție folosind bigrame\n",
    "\n",
    "Implementați funcția `predict2` care să calculeze logaritmul probabilității fiecărei clase pe baza bigramelor din text. Trebuie să calculați `log_pos` și `log_neg`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in the opening shot of midnight cowboy , we see a close-up of a blank movie screen at a drive-in . \n",
      "we hear in the soundtrack human cries and the stomping of horses' hooves . \n",
      "without an image projected onto the screen , the audience unerringly identifies the familiar sound of cowboys chasing indians and can spontaneously fill in the blank screen with images of old westerns in our mind's eye . \n",
      "even without having seen a cowboys and indians movie , somehow the cliched images of them seem to have found their way into our mental schema . \n",
      "but do cowboys really exist , or are they merely hollywood images personified by john wayne and gary cooper ? \n",
      "exploring this theme , director john schlesinger uses the idea of the cowboy as a metaphor for the american dream , an equally cliched yet ambiguous concept . \n",
      "is the ease at which salvation and success can be attained in america a hallmark of its experience or an urban legend ? \n",
      "midnight cowboy suggests that the american dream , like image of the cowboy , is merely a myth . \n",
      "as joe buck migrates from place to place , he finds neither redemption nor reward in his attempt to create a life for himself , only further degeneration . \n",
      "during the opening credits , joe walks past an abandoned theater whose decrepit marquee reads `john wayne : the alamo . ' \n",
      "as joe is on the bus listening to a radio talk show , a lady on the air describes her ideal man as `gary cooper ? but he's dead . ' \n",
      "a troubled expression comes across joe's face , as he wonders where have all the cowboys gone . \n",
      "having adopted the image of a cowboy since youth , joe now finds himself deserted by the persona he tried to embody . \n",
      "joe's persistence in playing the act of the cowboy serves as an analogue to his american dream . \n",
      "he romanticizes about making it in the big city , but his dreams will desert him as he is forced to compromise his ideals for sustenance . \n",
      "by the end of midnight cowboy , joe buck loses everything and gains nothing . \n",
      "just as the audience can picture cowboys chasing indians on a blank screen , we can also conjure up scenes from pretty woman as paradigms of american redemption and success . \n",
      "but how realistic are these ideals ? \n",
      "joe had raped and been raped in texas . \n",
      "the scars of his troubled past prompt him to migrate to new york , but he does not know that his aspirations to be a cowboy hero will fail him there just as they had in texas . \n",
      "alongside the dream of success is the dream of salvation . \n",
      "the ability to pack up one's belongings and start anew seems to be an exclusive american convention . \n",
      "schlesinger provides us with strong hints as to joe's abusive and abused past with flashbacks of improper relationships with crazy anne and granny . \n",
      "we understand that joe adopts the fa ? ade of a cowboy , a symbol of virility and gallantry , as an attempt to neutralize his shame . \n",
      "he runs from his past only to be sexually defiled this time by his homosexual experiences in new york . \n",
      "in the scene at the diner which foreshadows joe's encounter with the gay student , joe buck spills ketchup on himself . \n",
      "standing up , we see the ketchup has made a red stain running from the crotch of his pants down his thigh . \n",
      "schlesinger visually depicts the degeneration of joe's virility by eliciting an image of bleeding genitals , signifying emasculation . \n",
      "beyond the symbol of castration , the scene may also connote the bleeding of a virgin's first sexual encounter , a reference to joe's first homosexual liaison . \n",
      "the fact that the idea of a bleeding virgin is relegated only to females furthers the imagery of joe's emasculation . \n",
      "it is ironic that joe has trouble prospecting for female clients , but effortlessly attracts men . \n",
      "joe believes his broncobuster getup is emblematic of his masculinity ; new yorkers see his ensemble as camp and `faggot stuff . ' \n",
      "there are two predominant images of new york . \n",
      "the first is that new york is the rich , cosmopolitan city where hope and opportunity are symbolized by the tall skyscrapers and the statue of liberty . \n",
      "the other new york is travis bickle's new york , a seedy , corruptive hell on earth . \n",
      "joe envisions new york as the former , but is presented with the latter . \n",
      "mirroring the irony in which joe envisions his cowboy attire as masculine , he mistakenly buys into the fable that new york is filled with lonely women neglected by gay men . \n",
      "joe thinks he is performing a great service for new york , but the city rapes him of his pride and possessions . \n",
      "the people steal joe's money , the landlord confiscates his luggage , and the homosexuals rob him of his dignity . \n",
      "what has become of joe's american dream ? \n",
      "schlesinger responds to this question with the scene at the party . \n",
      "joe gets invited to a shindig of sorts and at the gathering is exposed to a dizzying array of food , drugs , and sex . \n",
      "at the party , all of joe and ratzo's desires are made flesh ; joe flirts successfully with women and ratzo loads up on free salami . \n",
      "contrasting joe's daily struggles , shots of warhol's crew display wanton indulgence . \n",
      "there is an irreverence in the partygoers' attitude ; we see a shot of a woman kowtowing to nothing in particular , orgies breaking out in the periphery , and drugs passed around like party favors . \n",
      "the party makes a mockery of joe' s ideals . \n",
      "joe believed that hard work and persistence were the elements for success in america ; scenes of the party and his rendezvous with shirley suggest that it is the idle who profit from joe's toils . \n",
      "the american dream , schlesinger suggests , is merely a proletarian fantasy , for those who are content no longer dream , but become indolent . \n",
      "as joe heads to miami , all that was significant of the cowboy image has left him . \n",
      "his masculinity is compromised and his morality is relinquished . \n",
      "for joe , nothing is left of the cowboy hero and commensurately , he surrenders the identity . \n",
      "tossing his boots into the garbage , he returns to the bus for the last leg of his journey to miami . \n",
      "the final shot of midnight cowboy shows joe inside the bus , more introspective , taking only a few glances outside the window . \n",
      "instead of the frequent pov shots of joe excitedly looking out of the bus on his way to new york , schlesinger sets up this final shot from the exterior of the bus looking in through the window at joe . \n",
      "reflections of the palm trees ratzo so raved about run across the bus' window with joe hardly taking notice . \n",
      "the scenery of miami no longer exacts the same excitement from joe as before . \n",
      "the world seems smaller to joe now ; the termination of his journey coincides with the termination of his american dream . \n",
      "no longer does joe aspire to be the enterprising gigolo ; he resolves to return to a normal job and resign to basic means . \n",
      "midnight cowboy presents two familiar incarnations of the american dream . \n",
      "there is the frontier fantasy that if you are brave enough to repel a few indians , you can set up a ranch out west and raise a beautiful family . \n",
      "then there is the jay gatsby dream that a man of humble stock , with perseverance , can make a fortune in the big city . \n",
      "joe's attempt to realize these dreams robs him of his innocence in texas and morality in new york . \n",
      "during his search for an intangible paradise , joe ends up raping a girl and killing a man . \n",
      "an allegory of chasing the promise of the american dream , joe buck's progressive moral atrophy is a warning against the pursuit of illusory icons . \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "('pos', -15094.568084601153)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def predict2(params, path, alpha = 1):\n",
    "    (bigrams, pos_bigrams_no, neg_bigrams_no) = params\n",
    "    log_pos = log(0.5)\n",
    "    log_neg = log(0.5)\n",
    "    \n",
    "    # ----------------------------------------------------------------------\n",
    "    # <TODO 5> Calculul logaritmilor probabilităților folosind bigramele\n",
    "    \n",
    "    # ----------------------------------------------------------------------\n",
    "    vocsize = len(bigrams)\n",
    "    for b in bgr(path):\n",
    "        if b not in bigrams:\n",
    "            log_pos += log(alpha / (pos_bigrams_no + vocsize * alpha))\n",
    "            log_neg += log(alpha / (neg_bigrams_no + vocsize * alpha))\n",
    "            continue\n",
    "        pos, neg = bigrams[b]\n",
    "        log_pos += log((pos + alpha) / (pos_bigrams_no + vocsize * alpha))\n",
    "        log_neg += log((neg + alpha) / (neg_bigrams_no + vocsize * alpha))\n",
    "    \n",
    "    if log_pos > log_neg:\n",
    "        return \"pos\", log_pos\n",
    "    else:\n",
    "        return \"neg\", log_neg\n",
    "    \n",
    "# -- VERIFICARE --\n",
    "print(zipi.read(pos_test[14]).decode(\"utf-8\"))\n",
    "predict2(count_bigrams(), pos_test[14])\n",
    "\n",
    "# Daca se comentează liniile care reordonează aleator listele cu exemplele pozitive și negative,\n",
    "# rezultatul așteptat este:\n",
    "#\n",
    "# ('pos', -3034.428732037113) pentru același film cu Hugh Grant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Acuratetea pe setul de date de test, cu bigrame:  85.75 %. Matricea de confuzie:\n",
      "    |     pos      |     neg    \n",
      "--- + ------------ + ------------\n",
      "pos |     165      |      35    \n",
      "neg |      22      |     178    \n"
     ]
    }
   ],
   "source": [
    "# -- VERIFICARE --\n",
    "(acc_bigrams, cm_bigrams) = evaluate(count_bigrams(), predict2)\n",
    "print(\"Acuratetea pe setul de date de test, cu bigrame: \", acc_bigrams * 100, \"%. Matricea de confuzie:\")\n",
    "print_confusion_matrix(cm_bigrams)\n",
    "\n",
    "# Daca se comentează liniile care reordonează aleator listele cu exemplele pozitive și negative,\n",
    "# rezultatul așteptat este:\n",
    "#\n",
    "# Acuratetea pe setul de date de test:  84.5 %. Matricea de confuzie:\n",
    "#     |     pos      |     neg    \n",
    "# --- + ------------ + ------------\n",
    "# pos |     161      |      39    \n",
    "# neg |      23      |     177   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## La final...\n",
    "\n",
    " 1. Decomentați liniile care reordonează aleator listele cu exemplele pozitive și cele negative (secțiunea \"Setul de antrenare și setul de testare\"). Rulați de mai multe ori. Este întotdeauna mai bun modelul cu bigrame? Acuratețea variază mult de la o rulare la alta?\n",
    " 2. Încercați să eliminați cuvintele de legătură (linia cu `STOP_WORDS`, din secțiunea \"Construirea vocabularului...\"). Ce impact are asupra performanței celor două modele?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Acuratetea pe setul de date de test, cu cuvinte simple:  82.75 %. Matricea de confuzie:\n",
      "    |     pos      |     neg    \n",
      "--- + ------------ + ------------\n",
      "pos |     164      |      36    \n",
      "neg |      33      |     167    \n",
      "\n",
      "\n",
      "Acuratetea pe setul de date de test, cu bigrame:  85.75 %. Matricea de confuzie:\n",
      "    |     pos      |     neg    \n",
      "--- + ------------ + ------------\n",
      "pos |     165      |      35    \n",
      "neg |      22      |     178    \n"
     ]
    }
   ],
   "source": [
    "print(\"Acuratetea pe setul de date de test, cu cuvinte simple: \", acc_words * 100, \"%. Matricea de confuzie:\")\n",
    "print_confusion_matrix(cm_words)\n",
    "\n",
    "print(\"\\n\\nAcuratetea pe setul de date de test, cu bigrame: \", acc_bigrams * 100, \"%. Matricea de confuzie:\")\n",
    "print_confusion_matrix(cm_bigrams)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }