Skip to content

Instantly share code, notes, and snippets.

@pmbaumgartner
Last active February 7, 2019 00:29
Show Gist options
  • Save pmbaumgartner/e65345c6ee6611e05dc47ead4950035b to your computer and use it in GitHub Desktop.
Save pmbaumgartner/e65345c6ee6611e05dc47ead4950035b to your computer and use it in GitHub Desktop.
⚡️Applied Natural Language Processing in Python ⚡️
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# ⚡️Applied Natural Language Processing in Python ⚡️\n",
"## Peter Baumgartner\n",
"### Data Scientist @ RTI International\n",
"#### Notebook @ [http://bit.ly/omg-nlp](http://bit.ly/omg-nlp)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Requirements\n",
"\n",
"*(in a virtualenv/conda env)*\n",
"\n",
"```bash\n",
"$ pip install spacy gensim pandas tabulate\n",
"$ python -m spacy download en_core_web_lg\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Data\n",
"\n",
"## [HappyDB](https://rit-public.github.io/HappyDB/)\n",
"\n",
"<img src=\"http://funkyimg.com/i/2JCu6.png\" border=\"0\" width=\"600\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Data Load"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"96486"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from tabulate import tabulate\n",
"\n",
"data_url = 'https://github.com/rit-public/HappyDB/' \\\n",
"'raw/master/happydb/data/cleaned_hm.csv'\n",
"\n",
"happy = (pd.read_csv(data_url, usecols=['cleaned_hm'])\n",
" .drop_duplicates('cleaned_hm'))\n",
"\n",
"len(happy)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Parsing Texts"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import spacy\n",
"\n",
"nlp = spacy.load('en_core_web_lg')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# https://github.com/explosion/spaCy/issues/1574\n",
"for word in nlp.Defaults.stop_words:\n",
" for w in (word, word[0].upper() + word[1:], word.upper()):\n",
" lex = nlp.vocab[w]\n",
" lex.is_stop = True"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"texts = (text for (index, text) in happy['cleaned_hm'].iteritems())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 8min 40s, sys: 1min 14s, total: 9min 54s\n",
"Wall time: 5min 57s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"docs = []\n",
"for doc in nlp.pipe(texts, n_threads=-1):\n",
" docs.append(doc)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I went on a successful date with someone I felt sympathy and connection with.\n",
"I was happy when my son got 90% marks in his examination \n",
"I went to the gym this morning and did yoga.\n",
"We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.\n",
"I went with grandchildren to butterfly display at Crohn Conservatory\r\n",
"\n"
]
}
],
"source": [
"print(*docs[:5], sep=\"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I made vacation plans with my daughter today for Florida in July. \n",
"\n"
]
}
],
"source": [
"sample_doc = docs[403]\n",
"\n",
"print(sample_doc, \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Tokens & Lemmas\n",
"Iterating over a parsed document will give you tokens. Tokens have attributes that are calculated during the parsing step."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tokens lemmas\n",
"-------- --------\n",
"I -PRON-\n",
"made make\n",
"vacation vacation\n",
"plans plan\n",
"with with\n",
"my -PRON-\n",
"daughter daughter\n",
"today today\n",
"for for\n",
"Florida florida\n",
"in in\n",
"July july\n",
". .\n"
]
}
],
"source": [
"tokens_and_lemmas = [(token.text, token.lemma_) for token in sample_doc]\n",
"\n",
"print(tabulate(tokens_and_lemmas, headers=['tokens', 'lemmas']))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Named Entities (Proper Nouns)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('today', 'DATE'), ('Florida', 'GPE'), ('July', 'DATE')]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_doc_entities = sample_doc.ents\n",
"\n",
"[(ent.text, ent.label_) for ent in sample_doc_entities]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Countries, cities, states'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spacy.explain('GPE')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"<table class=\"c-table o-block o-block-small\"><tbody><tr class=\"c-table__row c-table__row--head\"><th colspan=\"2\" class=\"c-table__head-cell u-text-label\">NER accuracy</th></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER F <span data-tooltip=\"Entities (F-score)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (F-score)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>85.85</span><!----></td></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER P <span data-tooltip=\"Entities (precision)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (precision)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>85.54</span><!----></td></tr><tr class=\"c-table__row\"><td class=\"c-table__cell u-text u-nowrap\"><div class=\"u-text-label u-color-dark\">NER R <span data-tooltip=\"Entities (recall)\" class=\"u-color-subtle\"><span aria-role=\"tooltip\" class=\"u-hidden\">Entities (recall)</span><svg aria-hidden=\"true\" viewBox=\"0 0 16 16\" width=\"16\" height=\"16\" class=\"o-icon o-icon--inline\" style=\"min-width: 16px;\"><use xlink:href=\"#svg_help_o\"></use></svg></span></div></td><td class=\"c-table__cell u-text c-table__cell--num\"><span>86.16</span><!----></td></tr></tbody></table>\n",
"\n",
"from: https://spacy.io/models/en#en_core_web_lg"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Let's get analyzing!\n",
"\n",
"Pattern:\n",
"- Collect all the tokens and attributes we want in a `list`\n",
"- Throw them in a `Counter`\n",
"- Print out the most common values"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Most Common Words"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"lemma count\n",
"------- -------\n",
"-PRON- 260425\n",
". 109403\n",
"a 69293\n",
"be 65667\n",
"to 54645\n",
"and 54633\n",
"the 50068\n",
", 30174\n",
"for 26384\n",
"have 25624\n",
"in 25131\n",
"of 24057\n",
"that 21908\n",
"with 21442\n",
"get 19325\n"
]
}
],
"source": [
"from collections import Counter\n",
"\n",
"all_lemmas = [token.lemma_ for doc in docs for token in doc]\n",
"\n",
"most_common_lemmas = Counter(all_lemmas).most_common(15)\n",
"\n",
"print(tabulate(most_common_lemmas, headers=['lemma', 'count']))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Same as:\n",
"```python\n",
"all_lemmas = []\n",
"for doc in docs:\n",
" for token in doc:\n",
" all_lemmas.append(token)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Removing Stopwords & Punctuation"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"lemma count\n",
"------- -------\n",
"happy 19156\n",
"get 15485\n",
"go 11832\n",
"friend 10352\n",
"work 9619\n",
"day 9446\n",
"time 9310\n",
"new 8492\n",
"good 7985\n",
"feel 6008\n",
"month 5148\n",
"able 5098\n",
"today 4983\n",
"find 4749\n",
"come 4628\n"
]
}
],
"source": [
"def token_filter(token):\n",
" return not any((token.is_punct, token.is_stop, token.is_space))\n",
"\n",
"\n",
"all_clean_lemmas = [\n",
" token.lemma_ for doc in docs for token in doc if token_filter(token)]\n",
"\n",
"most_common_good_lemmas = Counter(all_clean_lemmas).most_common(15)\n",
"\n",
"print(tabulate(most_common_good_lemmas, headers=['lemma', 'count']))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Part-of-Speech (POS) Tags"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"all_tokens = [token for doc in docs for token in doc]\n",
"\n",
"nouns = [token.lemma_ for token in all_tokens if token.pos_ == 'NOUN']\n",
"verbs = [token.lemma_ for token in all_tokens if token.pos_ == 'VERB']\n",
"adjectives = [token.lemma_ for token in all_tokens if token.pos_ == 'ADJ']\n",
"\n",
"noun_count = Counter(nouns).most_common(15)\n",
"verb_count = Counter(verbs).most_common(15)\n",
"adjective_count = Counter(adjectives).most_common(15)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"nouns noun_count verbs verb_count adj adj_count\n",
"--------- ------------ ------- ------------ -------- -----------\n",
"friend 10179 be 65662 -PRON- 83505\n",
"time 9285 have 25624 happy 19058\n",
"day 8835 get 19263 that 9131\n",
"work 6457 make 14923 new 8223\n",
"month 5026 go 14509 good 7905\n",
"today 4949 do 7698 last 5976\n",
"family 4392 see 7449 able 5059\n",
"week 4242 feel 5966 first 4089\n",
"year 4145 find 4741 which 3585\n",
"son 3512 come 4622 great 3343\n",
"yesterday 3500 take 4514 old 3235\n",
"night 3434 watch 4183 nice 2975\n",
"daughter 3345 buy 3865 long 2970\n",
"dinner 3343 play 3565 favorite 2817\n",
"job 3135 give 3174 few 2220\n"
]
}
],
"source": [
"# this is black magic, don't ask me about it 🎩\n",
"rows = zip(*zip(*noun_count),\n",
" *zip(*verb_count),\n",
" *zip(*adjective_count))\n",
"\n",
"columns = ['nouns', 'noun_count', 'verbs', 'verb_count', 'adj', 'adj_count']\n",
"\n",
"print(tabulate(rows, columns))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 🛑 End Part 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`<Intentionally Blank>`"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment