Forked from francoiseprovencher/Python tools for NLP.ipynb
Created
July 23, 2014 15:02
-
-
Save cesine/65f777c5374078ead979 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Python tools for NLP\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 4, | |
"metadata": {}, | |
"source": [ | |
"by Fran\u00e7oise Provencher (demo for PyLadies Montreal, July 17th 2014)" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 2, | |
"metadata": {}, | |
"source": [ | |
"Fantasia festival starts today!" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"So many films, so little time. Which ones to choose?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"[Check it out! 3 weeks of genre films!](http://fantasiafest.com/2014/en/films-schedule/films) Can we find similar movies by using their synopsis?" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 2, | |
"metadata": {}, | |
"source": [ | |
"1 - Let's build a corpus of film synopses. We need to scrape the web." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In which we profusely use [Pattern](http://www.clips.ua.ac.be/pages/pattern-web) for its web-friendly features." | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Get all the links pointing to the feature films and their duration" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from pattern.web import URL, download, plaintext, Element, abs, Text\n", | |
"import re" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# The main page of Fantasia fest listing all the films\n", | |
"url = URL('http://fantasiafest.com/2014/en/films-schedule/films')\n", | |
"html = url.download(unicode=True)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#List of links to all the films\n", | |
"element = Element(html)\n", | |
"links=[]\n", | |
"for link in element('h4 a'):\n", | |
" formatted_link = abs(link.attributes.get('href',''), base=url.redirect or url.string)\n", | |
" links.append(formatted_link)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#List of durations\n", | |
"element = Element(html)\n", | |
"duration_pat = re.compile(r'[0-9]* min')\n", | |
"durations=[]\n", | |
"for e in element('div.info ul'):\n", | |
" specs = plaintext(e.content)\n", | |
" duration = int(duration_pat.search(specs).group()[:-4])\n", | |
" durations.append(duration)\n", | |
" print duration" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#List only films with duration over 45 minutes\n", | |
"feature_films = [link for (link, duration) in zip(links,durations) if duration>45]\n", | |
"print feature_films" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"For each feature film, get the synopsis" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Demo for only one film\n", | |
"link = feature_films[165]\n", | |
"html = download(link)\n", | |
"element = Element(html)\n", | |
"title = plaintext(element('h1')[1].content)\n", | |
"synopsis = \"\\n\".join([plaintext(e.content) for e in (element('div.synopsis p'))])\n", | |
"print title\n", | |
"print synopsis" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Use a loop to get all the links\n", | |
"fantasia2014={}\n", | |
"for link in feature_films:\n", | |
" html = download(link)\n", | |
" element = Element(html)\n", | |
" title = plaintext(element('h1')[1].content)\n", | |
" synopsis = \"\\n\".join([plaintext(e.content) for e in (element('div.synopsis p'))])\n", | |
" fantasia2014[title]=synopsis\n", | |
" #print title" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Done! This kind of web-scraping can also be done with other modules such as Requests (for URL requests) and BeautifulSoup (for parsing HTML)" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 2, | |
"metadata": {}, | |
"source": [ | |
"2 - Now that we have the raw text, let's clean it!" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In which we profusely use [NLTK](http://www.nltk.org/) for its classic tokenizer, stemmer and lemmatizer. Check out the awesome free book [Natural Language Processing with Python](http://www.nltk.org/book/)." | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Splitting the text into words : Tokenization" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Fast and dirty : split on whitespace, remove preceding/trailing punctuation\n", | |
"punctuation = u\",.;:'()\\u201c\\u2026\\u201d\\u2013\\u2019\\u2014\"\n", | |
"splitted_text = fantasia2014[\"The Zero Theorem\"].split()\n", | |
"clean_text = [w.strip(punctuation) for w in splitted_text]\n", | |
"print clean_text" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# More sophisticated : using a tokenizer\n", | |
"import nltk\n", | |
"synopsis = fantasia2014[\"The Zero Theorem\"]\n", | |
"tokens = [word for sent in nltk.sent_tokenize(synopsis) for word in nltk.word_tokenize(sent)]\n", | |
"print tokens\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# I prefer my fast and dirty way for this corpus, so let's use a loop to apply it \n", | |
"# it on all the texts\n", | |
"\n", | |
"punctuation = u\",.;:'()\\u201c\\u2026\\u201d\\u2013\\u2019\\u2014\"\n", | |
"fantasia2014_tokenized = dict()\n", | |
"\n", | |
"for title in fantasia2014:\n", | |
" splitted_text = fantasia2014[title].split()\n", | |
" fantasia2014_tokenized[title] = [w.strip(punctuation) for w in splitted_text\n", | |
" if w.strip(punctuation) != \"\"]\n", | |
" \n", | |
"#print fantasia2014_tokenized[\"The Zero Theorem\"]\n", | |
" " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Getting the root of the words : stemming and lemmatization" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Stemming : uses rules to chop off end of words\n", | |
"stemmer = nltk.stem.porter.PorterStemmer()\n", | |
"singular = stemmer.stem(\"zombie\")\n", | |
"plural = stemmer.stem(\"zombies\")\n", | |
"\n", | |
"print singular, plural\n", | |
"print (singular==plural)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Lemmatizing : uses a dictionnary\n", | |
"from nltk import WordNetLemmatizer as wnl\n", | |
"singular = wnl().lemmatize(\"zombie\")\n", | |
"plural = wnl().lemmatize(\"zombies\")\n", | |
"\n", | |
"print singular, plural\n", | |
"print (singular==plural)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# I like the lemmatization better.\n", | |
"# Let's lemmatize all the texts\n", | |
"fantasia2014_lemma = dict()\n", | |
"\n", | |
"for title in fantasia2014_tokenized:\n", | |
" synopsis = []\n", | |
" for word in fantasia2014_tokenized[title]:\n", | |
" lemma= wnl().lemmatize(word.lower()) #lowercasing text is another normalization\n", | |
" synopsis.append(lemma)\n", | |
" fantasia2014_lemma[title] = synopsis\n", | |
" \n", | |
"print fantasia2014_lemma[\"The Zero Theorem\"]\n", | |
" " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Just for fun : stopwords and collocations" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Collocations are frequent bigrams (pair of words) that occur often together\n", | |
"# Get the collocations\n", | |
"all_texts = []\n", | |
"for title in fantasia2014_lemma:\n", | |
" all_texts.extend(fantasia2014_lemma[title])\n", | |
" \n", | |
"bigrams = nltk.collocations.BigramAssocMeasures()\n", | |
"finder = nltk.collocations.BigramCollocationFinder.from_words(all_texts)\n", | |
"scored = finder.score_ngrams(bigrams.likelihood_ratio)\n", | |
"\n", | |
"print scored" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Lets remove the stopwords (a, the, in, into, on ...) and try again\n", | |
"\n", | |
"stop = nltk.corpus.stopwords.words('english') #list of stopwords from NLTK\n", | |
"fantasia2014_stop=dict()\n", | |
"\n", | |
"for title in fantasia2014_lemma:\n", | |
" fantasia2014_stop[title] = [w for w in fantasia2014_lemma[title] if w not in stop]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#This is the same as above, but with the stopwords removed\n", | |
"all_texts = []\n", | |
"for title in fantasia2014_stop:\n", | |
" all_texts.extend(fantasia2014_stop[title])\n", | |
" \n", | |
"bigrams = nltk.collocations.BigramAssocMeasures()\n", | |
"finder = nltk.collocations.BigramCollocationFinder.from_words(all_texts)\n", | |
"scored = finder.score_ngrams(bigrams.likelihood_ratio)\n", | |
"\n", | |
"print scored" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"NLTK has a lot more to offer : part-of-speech tagging, etc. Have a look to see if it's the right fit for you!" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 2, | |
"metadata": {}, | |
"source": [ | |
"3 - Now that we have a clean corpus, let's train a linguistic model to find similarity between documents" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In which we profusely use [Gensim](http://radimrehurek.com/gensim/index.html), which is great for topic modeling. Also check-out the awesome blog of the developer (that guy took Google's word2vec C code and made it faster in Python). The following is an adaptation of the tutorials [found here](http://radimrehurek.com/gensim/tutorial.html), please refer to them for more explanations." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from gensim import corpora, models, similarities\n", | |
"\n", | |
"#put the text in the right format : lists\n", | |
"titles=[]\n", | |
"texts=[]\n", | |
"for title in fantasia2014_stop:\n", | |
" titles.append(title)\n", | |
" texts.append(fantasia2014_stop[title])\n", | |
" \n", | |
"#remove words that occur only once to reduce the size\n", | |
"all_tokens = sum(texts, [])\n", | |
"tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)\n", | |
"texts = [[word for word in text if word not in tokens_once]\n", | |
" for text in texts]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Build a model (TF-IDF)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Term frequency\u2013inverse document frequency [(TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) gives the importance of a word in a document, as it is frequent in that document but not very frequent in all the documents taken together." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Build a model\n", | |
"dictionary = corpora.Dictionary(texts)\n", | |
"corpus = [dictionary.doc2bow(text) for text in texts]\n", | |
"\n", | |
"tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model\n", | |
"corpus_tfidf = tfidf[corpus] # step 2 -- apply the transformation to the corpus" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# What does it look like?\n", | |
"for doc in corpus_tfidf:\n", | |
" print(doc)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Topic modeling : Latent sementic indexing" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"TF-IDF is fine, but what if we have 2 documents talking about the same thing but with different words, e.g. \"**Funny zombie movie**\" and \"**comedy of the undead**\"? Well, if all these words appear sometimes together in other documents, they could be assigned to the same **topic** and we could use these topics to find the similarity between documents. [Latent sementic indexing](http://en.wikipedia.org/wiki/Latent_semantic_indexing) uses singular value decomposition (SVD)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=15)\n", | |
"lsi.print_topics(5)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# What does this looks like?\n", | |
"for doc in corpus_lsi:\n", | |
" print doc" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Now let's find a film similar to The Zero Theorem" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Which titles can we play with?\n", | |
"#print titles" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Get the indice of the film we wish to query\n", | |
"ind = titles.index(\"The Zero Theorem\")\n", | |
"\n", | |
"#Transform film synopsis to LSI space\n", | |
"doc = texts[ind]\n", | |
"vec_bow = dictionary.doc2bow(doc)\n", | |
"vec_lsi = lsi[vec_bow] # convert the query to LSI space\n", | |
" \n", | |
"print(vec_lsi)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#transform corpus to LSI space and index it IN RAM!\n", | |
"index = similarities.MatrixSimilarity(lsi[corpus]) \n", | |
"\n", | |
"# perform a similarity query against the corpus and sort them\n", | |
"sims = index[vec_lsi] \n", | |
"sims = sorted(enumerate(sims), key=lambda item: -item[1])\n", | |
"\n", | |
"# print out nicely the first 10 films\n", | |
"for i, (document_num, sim) in enumerate(sims) : # print sorted (document number, similarity score) 2-tuples\n", | |
" print titles[document_num], str(sim)\n", | |
" if i > 10 : break" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Fun with Word2Vec" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Train the model with our corpus\n", | |
"model_w2v = models.Word2Vec(texts, min_count=3)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Query to find the most similar word\n", | |
"model_w2v.most_similar(positive=['horror'], topn=5)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Query the model\n", | |
"this = \"light\"\n", | |
"is_to = \"dark\"\n", | |
"what = \"angel\"\n", | |
"is_to2= model_w2v.most_similar(positive=[is_to, what], negative=[this], topn=3)\n", | |
"\n", | |
"print this+' is to '+is_to+' as '+what+' is to : '\n", | |
"print is_to2" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Our corpus is too small to get an accurate model. Let's use Google news instead." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Load the model, downloaded from : https://code.google.com/p/word2vec/\n", | |
"model_GN = models.Word2Vec.load_word2vec_format('/Users/francoiseprovencher/Documents/Word2VecBinaries/GoogleNews-vectors-negative300.bin.gz', binary=True)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Query to find the most similar word\n", | |
"model_GN.most_similar(positive=['zombie'], topn=5)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#Query the model\n", | |
"this = \"light\"\n", | |
"is_to = \"dark\"\n", | |
"what = \"angel\"\n", | |
"is_to2= model_GN.most_similar(positive=[is_to, what], negative=[this], topn=3)\n", | |
"\n", | |
"print this+' is to '+is_to+' as '+what+' is to : '\n", | |
"print is_to2" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment