pmbaumgartner · February 7, 2019 00:29
diff --git a/applied nlp.ipynb b/applied nlp.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# ⚡️Applied Natural Language Processing in Python ⚡️\n",
    "## Peter Baumgartner\n",
    "### Data Scientist @ RTI International\n",
    "#### Notebook @ [http://bit.ly/omg-nlp](http://bit.ly/omg-nlp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Requirements\n",
    "\n",
    "*(in a virtualenv/conda env)*\n",
    "\n",
    "```bash\n",
    "$ pip install spacy gensim pandas tabulate\n",
    "$ python -m spacy download en_core_web_lg\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# The Data\n",
    "\n",
    "## [HappyDB](https://rit-public.github.io/HappyDB/)\n",
    "\n",
    "<img src=\"http://funkyimg.com/i/2JCu6.png\" border=\"0\" width=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Data Load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "96486"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from tabulate import tabulate\n",
    "\n",
    "data_url = 'https://github.com/rit-public/HappyDB/' \\\n",
    "'raw/master/happydb/data/cleaned_hm.csv'\n",
    "\n",
    "happy = (pd.read_csv(data_url, usecols=['cleaned_hm'])\n",
    "         .drop_duplicates('cleaned_hm'))\n",
    "\n",
    "len(happy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Parsing Texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "\n",
    "nlp = spacy.load('en_core_web_lg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# https://github.com/explosion/spaCy/issues/1574\n",
    "for word in nlp.Defaults.stop_words:\n",
    "    for w in (word, word[0].upper() + word[1:], word.upper()):\n",
    "        lex = nlp.vocab[w]\n",
    "        lex.is_stop = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "texts = (text for (index, text) in happy['cleaned_hm'].iteritems())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 8min 40s, sys: 1min 14s, total: 9min 54s\n",
      "Wall time: 5min 57s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "docs = []\n",
    "for doc in nlp.pipe(texts, n_threads=-1):\n",
    "    docs.append(doc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I went on a successful date with someone I felt sympathy and connection with.\n",
      "I was happy when my son got 90% marks in his examination \n",
      "I went to the gym this morning and did yoga.\n",
      "We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.\n",
      "I went with grandchildren to butterfly display at Crohn Conservatory\r\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(*docs[:5], sep=\"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I made vacation plans with my daughter today for Florida in July. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "sample_doc = docs[403]\n",
    "\n",
    "print(sample_doc, \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Tokens & Lemmas\n",
    "Iterating over a parsed document will give you tokens. Tokens have attributes that are calculated during the parsing step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tokens    lemmas\n",
      "--------  --------\n",
      "I         -PRON-\n",
      "made      make\n",
      "vacation  vacation\n",
      "plans     plan\n",
      "with      with\n",
      "my        -PRON-\n",
      "daughter  daughter\n",
      "today     today\n",
      "for       for\n",
      "Florida   florida\n",
      "in        in\n",
      "July      july\n",
      ".         .\n"
     ]
    }
   ],
   "source": [
    "tokens_and_lemmas = [(token.text, token.lemma_) for token in sample_doc]\n",
    "\n",
    "print(tabulate(tokens_and_lemmas, headers=['tokens', 'lemmas']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Named Entities (Proper Nouns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('today', 'DATE'), ('Florida', 'GPE'), ('July', 'DATE')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_doc_entities = sample_doc.ents\n",
    "\n",
    "[(ent.text, ent.label_) for ent in sample_doc_entities]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Countries, cities, states'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spacy.explain('GPE')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "<table class=\"c-table o-block o-block-small\"><tbody><tr class=\"c-table__row c-table__row--head\"><th colspan=\"2\" class=\"c-table__head-cell u-text-label\">NER accuracy</th></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER F <span data-tooltip=\"Entities (F-score)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (F-score)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>85.85</span><!----></td></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER P <span data-tooltip=\"Entities (precision)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (precision)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>85.54</span><!----></td></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER R <span data-tooltip=\"Entities (recall)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (recall)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>86.16</span><!----></td></tr></tbody></table>\n",
    "\n",
    "from: https://spacy.io/models/en#en_core_web_lg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Let's get analyzing!\n",
    "\n",
    "Pattern:\n",
    "- Collect all the tokens and attributes we want in a `list`\n",
    "- Throw them in a `Counter`\n",
    "- Print out the most common values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Most Common Words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lemma      count\n",
      "-------  -------\n",
      "-PRON-    260425\n",
      ".         109403\n",
      "a          69293\n",
      "be         65667\n",
      "to         54645\n",
      "and        54633\n",
      "the        50068\n",
      ",          30174\n",
      "for        26384\n",
      "have       25624\n",
      "in         25131\n",
      "of         24057\n",
      "that       21908\n",
      "with       21442\n",
      "get        19325\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "all_lemmas = [token.lemma_ for doc in docs for token in doc]\n",
    "\n",
    "most_common_lemmas = Counter(all_lemmas).most_common(15)\n",
    "\n",
    "print(tabulate(most_common_lemmas, headers=['lemma', 'count']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Same as:\n",
    "```python\n",
    "all_lemmas = []\n",
    "for doc in docs:\n",
    "    for token in doc:\n",
    "        all_lemmas.append(token)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Removing Stopwords & Punctuation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lemma      count\n",
      "-------  -------\n",
      "happy      19156\n",
      "get        15485\n",
      "go         11832\n",
      "friend     10352\n",
      "work        9619\n",
      "day         9446\n",
      "time        9310\n",
      "new         8492\n",
      "good        7985\n",
      "feel        6008\n",
      "month       5148\n",
      "able        5098\n",
      "today       4983\n",
      "find        4749\n",
      "come        4628\n"
     ]
    }
   ],
   "source": [
    "def token_filter(token):\n",
    "    return not any((token.is_punct, token.is_stop, token.is_space))\n",
    "\n",
    "\n",
    "all_clean_lemmas = [\n",
    "    token.lemma_ for doc in docs for token in doc if token_filter(token)]\n",
    "\n",
    "most_common_good_lemmas = Counter(all_clean_lemmas).most_common(15)\n",
    "\n",
    "print(tabulate(most_common_good_lemmas, headers=['lemma', 'count']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Part-of-Speech (POS) Tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "all_tokens = [token for doc in docs for token in doc]\n",
    "\n",
    "nouns = [token.lemma_ for token in all_tokens if token.pos_ == 'NOUN']\n",
    "verbs = [token.lemma_ for token in all_tokens if token.pos_ == 'VERB']\n",
    "adjectives = [token.lemma_ for token in all_tokens if token.pos_ == 'ADJ']\n",
    "\n",
    "noun_count = Counter(nouns).most_common(15)\n",
    "verb_count = Counter(verbs).most_common(15)\n",
    "adjective_count = Counter(adjectives).most_common(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nouns        noun_count  verbs      verb_count  adj         adj_count\n",
      "---------  ------------  -------  ------------  --------  -----------\n",
      "friend            10179  be              65662  -PRON-          83505\n",
      "time               9285  have            25624  happy           19058\n",
      "day                8835  get             19263  that             9131\n",
      "work               6457  make            14923  new              8223\n",
      "month              5026  go              14509  good             7905\n",
      "today              4949  do               7698  last             5976\n",
      "family             4392  see              7449  able             5059\n",
      "week               4242  feel             5966  first            4089\n",
      "year               4145  find             4741  which            3585\n",
      "son                3512  come             4622  great            3343\n",
      "yesterday          3500  take             4514  old              3235\n",
      "night              3434  watch            4183  nice             2975\n",
      "daughter           3345  buy              3865  long             2970\n",
      "dinner             3343  play             3565  favorite         2817\n",
      "job                3135  give             3174  few              2220\n"
     ]
    }
   ],
   "source": [
    "# this is black magic, don't ask me about it 🎩\n",
    "rows = zip(*zip(*noun_count),\n",
    "           *zip(*verb_count),\n",
    "           *zip(*adjective_count))\n",
    "\n",
    "columns = ['nouns', 'noun_count', 'verbs', 'verb_count', 'adj', 'adj_count']\n",
    "\n",
    "print(tabulate(rows, columns))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 🛑 End Part 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "`<Intentionally Blank>`"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# ⚡️Applied Natural Language Processing in Python ⚡️\n",
	"## Peter Baumgartner\n",
	"### Data Scientist @ RTI International\n",
	"#### Notebook @ [http://bit.ly/omg-nlp](http://bit.ly/omg-nlp)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Requirements\n",
	"\n",
	"(in a virtualenv/conda env)\n",
	"\n",
	"```bash\n",
	"$ pip install spacy gensim pandas tabulate\n",
	"$ python -m spacy download en_core_web_lg\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# The Data\n",
	"\n",
	"## [HappyDB](https://rit-public.github.io/HappyDB/)\n",
	"\n",
	"<img src=\"http://funkyimg.com/i/2JCu6.png\" border=\"0\" width=\"600\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## Data Load"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"96486"
	]
	},
	"execution_count": 1,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import pandas as pd\n",
	"from tabulate import tabulate\n",
	"\n",
	"data_url = 'https://github.com/rit-public/HappyDB/' \\\n",
	"'raw/master/happydb/data/cleaned_hm.csv'\n",
	"\n",
	"happy = (pd.read_csv(data_url, usecols=['cleaned_hm'])\n",
	" .drop_duplicates('cleaned_hm'))\n",
	"\n",
	"len(happy)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## Parsing Texts"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"import spacy\n",
	"\n",
	"nlp = spacy.load('en_core_web_lg')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"slideshow": {
	"slide_type": "skip"
	}
	},
	"outputs": [],
	"source": [
	"# https://github.com/explosion/spaCy/issues/1574\n",
	"for word in nlp.Defaults.stop_words:\n",
	" for w in (word, word[0].upper() + word[1:], word.upper()):\n",
	" lex = nlp.vocab[w]\n",
	" lex.is_stop = True"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"texts = (text for (index, text) in happy['cleaned_hm'].iteritems())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 8min 40s, sys: 1min 14s, total: 9min 54s\n",
	"Wall time: 5min 57s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"docs = []\n",
	"for doc in nlp.pipe(texts, n_threads=-1):\n",
	" docs.append(doc)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"I went on a successful date with someone I felt sympathy and connection with.\n",
	"I was happy when my son got 90% marks in his examination \n",
	"I went to the gym this morning and did yoga.\n",
	"We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.\n",
	"I went with grandchildren to butterfly display at Crohn Conservatory\r\n",
	"\n"
	]
	}
	],
	"source": [
	"print(*docs[:5], sep=\"\\n\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"I made vacation plans with my daughter today for Florida in July. \n",
	"\n"
	]
	}
	],
	"source": [
	"sample_doc = docs[403]\n",
	"\n",
	"print(sample_doc, \"\\n\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Tokens & Lemmas\n",
	"Iterating over a parsed document will give you tokens. Tokens have attributes that are calculated during the parsing step."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"tokens lemmas\n",
	"-------- --------\n",
	"I -PRON-\n",
	"made make\n",
	"vacation vacation\n",
	"plans plan\n",
	"with with\n",
	"my -PRON-\n",
	"daughter daughter\n",
	"today today\n",
	"for for\n",
	"Florida florida\n",
	"in in\n",
	"July july\n",
	". .\n"
	]
	}
	],
	"source": [
	"tokens_and_lemmas = [(token.text, token.lemma_) for token in sample_doc]\n",
	"\n",
	"print(tabulate(tokens_and_lemmas, headers=['tokens', 'lemmas']))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Named Entities (Proper Nouns)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[('today', 'DATE'), ('Florida', 'GPE'), ('July', 'DATE')]"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sample_doc_entities = sample_doc.ents\n",
	"\n",
	"[(ent.text, ent.label_) for ent in sample_doc_entities]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'Countries, cities, states'"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"spacy.explain('GPE')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "skip"
	}
	},
	"source": [
	"<table class=\"c-table o-block o-block-small\"><tbody><tr class=\"c-table__row c-table__row--head\"><th colspan=\"2\" class=\"c-table__head-cell u-text-label\">NER accuracy</th></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER F <span data-tooltip=\"Entities (F-score)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (F-score)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>85.85</span><!----></td></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER P <span data-tooltip=\"Entities (precision)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (precision)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>85.54</span><!----></td></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER R <span data-tooltip=\"Entities (recall)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (recall)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>86.16</span><!----></td></tr></tbody></table>\n",
	"\n",
	"from: https://spacy.io/models/en#en_core_web_lg"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## Let's get analyzing!\n",
	"\n",
	"Pattern:\n",
	"- Collect all the tokens and attributes we want in a `list`\n",
	"- Throw them in a `Counter`\n",
	"- Print out the most common values"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Most Common Words"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"lemma count\n",
	"------- -------\n",
	"-PRON- 260425\n",
	". 109403\n",
	"a 69293\n",
	"be 65667\n",
	"to 54645\n",
	"and 54633\n",
	"the 50068\n",
	", 30174\n",
	"for 26384\n",
	"have 25624\n",
	"in 25131\n",
	"of 24057\n",
	"that 21908\n",
	"with 21442\n",
	"get 19325\n"
	]
	}
	],
	"source": [
	"from collections import Counter\n",
	"\n",
	"all_lemmas = [token.lemma_ for doc in docs for token in doc]\n",
	"\n",
	"most_common_lemmas = Counter(all_lemmas).most_common(15)\n",
	"\n",
	"print(tabulate(most_common_lemmas, headers=['lemma', 'count']))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "skip"
	}
	},
	"source": [
	"Same as:\n",
	"```python\n",
	"all_lemmas = []\n",
	"for doc in docs:\n",
	" for token in doc:\n",
	" all_lemmas.append(token)\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Removing Stopwords & Punctuation"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"lemma count\n",
	"------- -------\n",
	"happy 19156\n",
	"get 15485\n",
	"go 11832\n",
	"friend 10352\n",
	"work 9619\n",
	"day 9446\n",
	"time 9310\n",
	"new 8492\n",
	"good 7985\n",
	"feel 6008\n",
	"month 5148\n",
	"able 5098\n",
	"today 4983\n",
	"find 4749\n",
	"come 4628\n"
	]
	}
	],
	"source": [
	"def token_filter(token):\n",
	" return not any((token.is_punct, token.is_stop, token.is_space))\n",
	"\n",
	"\n",
	"all_clean_lemmas = [\n",
	" token.lemma_ for doc in docs for token in doc if token_filter(token)]\n",
	"\n",
	"most_common_good_lemmas = Counter(all_clean_lemmas).most_common(15)\n",
	"\n",
	"print(tabulate(most_common_good_lemmas, headers=['lemma', 'count']))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Part-of-Speech (POS) Tags"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"all_tokens = [token for doc in docs for token in doc]\n",
	"\n",
	"nouns = [token.lemma_ for token in all_tokens if token.pos_ == 'NOUN']\n",
	"verbs = [token.lemma_ for token in all_tokens if token.pos_ == 'VERB']\n",
	"adjectives = [token.lemma_ for token in all_tokens if token.pos_ == 'ADJ']\n",
	"\n",
	"noun_count = Counter(nouns).most_common(15)\n",
	"verb_count = Counter(verbs).most_common(15)\n",
	"adjective_count = Counter(adjectives).most_common(15)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {
	"scrolled": true,
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"nouns noun_count verbs verb_count adj adj_count\n",
	"--------- ------------ ------- ------------ -------- -----------\n",
	"friend 10179 be 65662 -PRON- 83505\n",
	"time 9285 have 25624 happy 19058\n",
	"day 8835 get 19263 that 9131\n",
	"work 6457 make 14923 new 8223\n",
	"month 5026 go 14509 good 7905\n",
	"today 4949 do 7698 last 5976\n",
	"family 4392 see 7449 able 5059\n",
	"week 4242 feel 5966 first 4089\n",
	"year 4145 find 4741 which 3585\n",
	"son 3512 come 4622 great 3343\n",
	"yesterday 3500 take 4514 old 3235\n",
	"night 3434 watch 4183 nice 2975\n",
	"daughter 3345 buy 3865 long 2970\n",
	"dinner 3343 play 3565 favorite 2817\n",
	"job 3135 give 3174 few 2220\n"
	]
	}
	],
	"source": [
	"# this is black magic, don't ask me about it 🎩\n",
	"rows = zip(zip(noun_count),\n",
	" zip(verb_count),\n",
	" zip(adjective_count))\n",
	"\n",
	"columns = ['nouns', 'noun_count', 'verbs', 'verb_count', 'adj', 'adj_count']\n",
	"\n",
	"print(tabulate(rows, columns))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# 🛑 End Part 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"`<Intentionally Blank>`"
	]
	}
	],
	"metadata": {
	"celltoolbar": "Slideshow",
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}