Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save cesine/65f777c5374078ead979 to your computer and use it in GitHub Desktop.
Save cesine/65f777c5374078ead979 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Python tools for NLP\n",
" "
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"by Fran\u00e7oise Provencher (demo for PyLadies Montreal, July 17th 2014)"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Fantasia festival starts today!"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"So many films, so little time. Which ones to choose?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Check it out! 3 weeks of genre films!](http://fantasiafest.com/2014/en/films-schedule/films) Can we find similar movies by using their synopsis?"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"1 - Let's build a corpus of film synopses. We need to scrape the web."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In which we profusely use [Pattern](http://www.clips.ua.ac.be/pages/pattern-web) for its web-friendly features."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Get all the links pointing to the feature films and their duration"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from pattern.web import URL, download, plaintext, Element, abs, Text\n",
"import re"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# The main page of Fantasia fest listing all the films\n",
"url = URL('http://fantasiafest.com/2014/en/films-schedule/films')\n",
"html = url.download(unicode=True)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#List of links to all the films\n",
"element = Element(html)\n",
"links=[]\n",
"for link in element('h4 a'):\n",
" formatted_link = abs(link.attributes.get('href',''), base=url.redirect or url.string)\n",
" links.append(formatted_link)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#List of durations\n",
"element = Element(html)\n",
"duration_pat = re.compile(r'[0-9]* min')\n",
"durations=[]\n",
"for e in element('div.info ul'):\n",
" specs = plaintext(e.content)\n",
" duration = int(duration_pat.search(specs).group()[:-4])\n",
" durations.append(duration)\n",
" print duration"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#List only films with duration over 45 minutes\n",
"feature_films = [link for (link, duration) in zip(links,durations) if duration>45]\n",
"print feature_films"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"For each feature film, get the synopsis"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Demo for only one film\n",
"link = feature_films[165]\n",
"html = download(link)\n",
"element = Element(html)\n",
"title = plaintext(element('h1')[1].content)\n",
"synopsis = \"\\n\".join([plaintext(e.content) for e in (element('div.synopsis p'))])\n",
"print title\n",
"print synopsis"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Use a loop to get all the links\n",
"fantasia2014={}\n",
"for link in feature_films:\n",
" html = download(link)\n",
" element = Element(html)\n",
" title = plaintext(element('h1')[1].content)\n",
" synopsis = \"\\n\".join([plaintext(e.content) for e in (element('div.synopsis p'))])\n",
" fantasia2014[title]=synopsis\n",
" #print title"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Done! This kind of web-scraping can also be done with other modules such as Requests (for URL requests) and BeautifulSoup (for parsing HTML)"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"2 - Now that we have the raw text, let's clean it!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In which we profusely use [NLTK](http://www.nltk.org/) for its classic tokenizer, stemmer and lemmatizer. Check out the awesome free book [Natural Language Processing with Python](http://www.nltk.org/book/)."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Splitting the text into words : Tokenization"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Fast and dirty : split on whitespace, remove preceding/trailing punctuation\n",
"punctuation = u\",.;:'()\\u201c\\u2026\\u201d\\u2013\\u2019\\u2014\"\n",
"splitted_text = fantasia2014[\"The Zero Theorem\"].split()\n",
"clean_text = [w.strip(punctuation) for w in splitted_text]\n",
"print clean_text"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# More sophisticated : using a tokenizer\n",
"import nltk\n",
"synopsis = fantasia2014[\"The Zero Theorem\"]\n",
"tokens = [word for sent in nltk.sent_tokenize(synopsis) for word in nltk.word_tokenize(sent)]\n",
"print tokens\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# I prefer my fast and dirty way for this corpus, so let's use a loop to apply it \n",
"# it on all the texts\n",
"\n",
"punctuation = u\",.;:'()\\u201c\\u2026\\u201d\\u2013\\u2019\\u2014\"\n",
"fantasia2014_tokenized = dict()\n",
"\n",
"for title in fantasia2014:\n",
" splitted_text = fantasia2014[title].split()\n",
" fantasia2014_tokenized[title] = [w.strip(punctuation) for w in splitted_text\n",
" if w.strip(punctuation) != \"\"]\n",
" \n",
"#print fantasia2014_tokenized[\"The Zero Theorem\"]\n",
" "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Getting the root of the words : stemming and lemmatization"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Stemming : uses rules to chop off end of words\n",
"stemmer = nltk.stem.porter.PorterStemmer()\n",
"singular = stemmer.stem(\"zombie\")\n",
"plural = stemmer.stem(\"zombies\")\n",
"\n",
"print singular, plural\n",
"print (singular==plural)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Lemmatizing : uses a dictionnary\n",
"from nltk import WordNetLemmatizer as wnl\n",
"singular = wnl().lemmatize(\"zombie\")\n",
"plural = wnl().lemmatize(\"zombies\")\n",
"\n",
"print singular, plural\n",
"print (singular==plural)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# I like the lemmatization better.\n",
"# Let's lemmatize all the texts\n",
"fantasia2014_lemma = dict()\n",
"\n",
"for title in fantasia2014_tokenized:\n",
" synopsis = []\n",
" for word in fantasia2014_tokenized[title]:\n",
" lemma= wnl().lemmatize(word.lower()) #lowercasing text is another normalization\n",
" synopsis.append(lemma)\n",
" fantasia2014_lemma[title] = synopsis\n",
" \n",
"print fantasia2014_lemma[\"The Zero Theorem\"]\n",
" "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Just for fun : stopwords and collocations"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Collocations are frequent bigrams (pair of words) that occur often together\n",
"# Get the collocations\n",
"all_texts = []\n",
"for title in fantasia2014_lemma:\n",
" all_texts.extend(fantasia2014_lemma[title])\n",
" \n",
"bigrams = nltk.collocations.BigramAssocMeasures()\n",
"finder = nltk.collocations.BigramCollocationFinder.from_words(all_texts)\n",
"scored = finder.score_ngrams(bigrams.likelihood_ratio)\n",
"\n",
"print scored"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Lets remove the stopwords (a, the, in, into, on ...) and try again\n",
"\n",
"stop = nltk.corpus.stopwords.words('english') #list of stopwords from NLTK\n",
"fantasia2014_stop=dict()\n",
"\n",
"for title in fantasia2014_lemma:\n",
" fantasia2014_stop[title] = [w for w in fantasia2014_lemma[title] if w not in stop]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#This is the same as above, but with the stopwords removed\n",
"all_texts = []\n",
"for title in fantasia2014_stop:\n",
" all_texts.extend(fantasia2014_stop[title])\n",
" \n",
"bigrams = nltk.collocations.BigramAssocMeasures()\n",
"finder = nltk.collocations.BigramCollocationFinder.from_words(all_texts)\n",
"scored = finder.score_ngrams(bigrams.likelihood_ratio)\n",
"\n",
"print scored"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"NLTK has a lot more to offer : part-of-speech tagging, etc. Have a look to see if it's the right fit for you!"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"3 - Now that we have a clean corpus, let's train a linguistic model to find similarity between documents"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In which we profusely use [Gensim](http://radimrehurek.com/gensim/index.html), which is great for topic modeling. Also check-out the awesome blog of the developer (that guy took Google's word2vec C code and made it faster in Python). The following is an adaptation of the tutorials [found here](http://radimrehurek.com/gensim/tutorial.html), please refer to them for more explanations."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from gensim import corpora, models, similarities\n",
"\n",
"#put the text in the right format : lists\n",
"titles=[]\n",
"texts=[]\n",
"for title in fantasia2014_stop:\n",
" titles.append(title)\n",
" texts.append(fantasia2014_stop[title])\n",
" \n",
"#remove words that occur only once to reduce the size\n",
"all_tokens = sum(texts, [])\n",
"tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)\n",
"texts = [[word for word in text if word not in tokens_once]\n",
" for text in texts]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Build a model (TF-IDF)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Term frequency\u2013inverse document frequency [(TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) gives the importance of a word in a document, as it is frequent in that document but not very frequent in all the documents taken together."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Build a model\n",
"dictionary = corpora.Dictionary(texts)\n",
"corpus = [dictionary.doc2bow(text) for text in texts]\n",
"\n",
"tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model\n",
"corpus_tfidf = tfidf[corpus] # step 2 -- apply the transformation to the corpus"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# What does it look like?\n",
"for doc in corpus_tfidf:\n",
" print(doc)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Topic modeling : Latent sementic indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TF-IDF is fine, but what if we have 2 documents talking about the same thing but with different words, e.g. \"**Funny zombie movie**\" and \"**comedy of the undead**\"? Well, if all these words appear sometimes together in other documents, they could be assigned to the same **topic** and we could use these topics to find the similarity between documents. [Latent sementic indexing](http://en.wikipedia.org/wiki/Latent_semantic_indexing) uses singular value decomposition (SVD)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=15)\n",
"lsi.print_topics(5)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# What does this looks like?\n",
"for doc in corpus_lsi:\n",
" print doc"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Now let's find a film similar to The Zero Theorem"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Which titles can we play with?\n",
"#print titles"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Get the indice of the film we wish to query\n",
"ind = titles.index(\"The Zero Theorem\")\n",
"\n",
"#Transform film synopsis to LSI space\n",
"doc = texts[ind]\n",
"vec_bow = dictionary.doc2bow(doc)\n",
"vec_lsi = lsi[vec_bow] # convert the query to LSI space\n",
" \n",
"print(vec_lsi)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#transform corpus to LSI space and index it IN RAM!\n",
"index = similarities.MatrixSimilarity(lsi[corpus]) \n",
"\n",
"# perform a similarity query against the corpus and sort them\n",
"sims = index[vec_lsi] \n",
"sims = sorted(enumerate(sims), key=lambda item: -item[1])\n",
"\n",
"# print out nicely the first 10 films\n",
"for i, (document_num, sim) in enumerate(sims) : # print sorted (document number, similarity score) 2-tuples\n",
" print titles[document_num], str(sim)\n",
" if i > 10 : break"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Fun with Word2Vec"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Train the model with our corpus\n",
"model_w2v = models.Word2Vec(texts, min_count=3)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Query to find the most similar word\n",
"model_w2v.most_similar(positive=['horror'], topn=5)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Query the model\n",
"this = \"light\"\n",
"is_to = \"dark\"\n",
"what = \"angel\"\n",
"is_to2= model_w2v.most_similar(positive=[is_to, what], negative=[this], topn=3)\n",
"\n",
"print this+' is to '+is_to+' as '+what+' is to : '\n",
"print is_to2"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Our corpus is too small to get an accurate model. Let's use Google news instead."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Load the model, downloaded from : https://code.google.com/p/word2vec/\n",
"model_GN = models.Word2Vec.load_word2vec_format('/Users/francoiseprovencher/Documents/Word2VecBinaries/GoogleNews-vectors-negative300.bin.gz', binary=True)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Query to find the most similar word\n",
"model_GN.most_similar(positive=['zombie'], topn=5)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Query the model\n",
"this = \"light\"\n",
"is_to = \"dark\"\n",
"what = \"angel\"\n",
"is_to2= model_GN.most_similar(positive=[is_to, what], negative=[this], topn=3)\n",
"\n",
"print this+' is to '+is_to+' as '+what+' is to : '\n",
"print is_to2"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment