Created
November 21, 2013 06:50
-
-
Save rjweiss/7577062 to your computer and use it in GitHub Desktop.
tfidf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "tfidf.ipynb" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# The Vector Space Model of text\n", | |
"\n", | |
"We need to start thinking about how to translate collections of texts into quantifiable phenomena. The easiest way to start is to think about word frequencies.\n", | |
"\n", | |
"I'm going to try and stay away from NLTK and Scikits-Learn for as much as I can. Let's nail down some basic concepts in Python first." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Basic term frequencies\n", | |
"\n", | |
"First, let's review how to get a count of terms per document: a term frequency vector.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#examples taken from here: http://stackoverflow.com/a/1750187\n", | |
"\n", | |
"mydoclist = ['Julie loves me more than Linda loves me',\n", | |
"'Jane likes me more than Julie loves me',\n", | |
"'He likes basketball more than baseball']\n", | |
"\n", | |
"#mydoclist = ['sun sky bright', 'sun sun bright']\n", | |
"\n", | |
"from collections import Counter\n", | |
"\n", | |
"for doc in mydoclist:\n", | |
" tf = Counter()\n", | |
" for word in doc.split():\n", | |
" tf[word] +=1\n", | |
" print tf.items()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)]\n", | |
"[('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)]\n", | |
"[('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]\n" | |
] | |
} | |
], | |
"prompt_number": 11 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here, we've introduced a new Python object called a Counter. Counters are only in Python 2.7 and higher. They're neat because they allow you to perform this exact kind of function; counting things in a loop.\n", | |
"\n", | |
"Let's call this a first stab at representing documents quantitatively, just by their word counts. But for those of you who are already tipped off by the \"vector\" in the vector space model, these aren't really comparable. This is because they're not in the same *vocabulary space*.\n", | |
"\n", | |
"What we really want is for every document to be the same length, where length is determined by the total vocabulary of our corpus." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import string #allows for format()\n", | |
" \n", | |
"def build_lexicon(corpus):\n", | |
" lexicon = set()\n", | |
" for doc in corpus:\n", | |
" lexicon.update([word for word in doc.split()])\n", | |
" return lexicon\n", | |
"\n", | |
"def tf(term, document):\n", | |
" return freq(term, document)\n", | |
"\n", | |
"def freq(term, document):\n", | |
" return document.split().count(term)\n", | |
"\n", | |
"vocabulary = build_lexicon(mydoclist)\n", | |
"\n", | |
"doc_term_matrix = []\n", | |
"print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'\n", | |
"for doc in mydoclist:\n", | |
" print 'The doc is \"' + doc + '\"'\n", | |
" tf_vector = [tf(word, doc) for word in vocabulary]\n", | |
" tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)\n", | |
" print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)\n", | |
" doc_term_matrix.append(tf_vector)\n", | |
" \n", | |
" # here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int...\n", | |
" # try it! type(mydoclist.index(doc) + 1)\n", | |
"\n", | |
"print 'All combined, here is our master document term matrix: '\n", | |
"print doc_term_matrix\n", | |
"\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]\n", | |
"The doc is \"Julie loves me more than Linda loves me\"\n", | |
"The tf vector for Document 1 is [2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1]\n", | |
"The doc is \"Jane likes me more than Julie loves me\"\n", | |
"The tf vector for Document 2 is [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1]\n", | |
"The doc is \"He likes basketball more than baseball\"\n", | |
"The tf vector for Document 3 is [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]\n", | |
"All combined, here is our master document term matrix: \n", | |
"[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]\n" | |
] | |
} | |
], | |
"prompt_number": 12 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Okay, that seems reasonable enough. If any of you have any experience with machine learning, what you've just seen is the creation of a feature space. Now every document is in the same feature space, meaning that we can represent the entire corpus in the same dimensional space without having lost too much information. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Normalizing vectors to L2 Norm = 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" \n", | |
"\n", | |
"Once you've got your data in the same feature space, you can start applying some machine learning methods; classifying, clustering, and so on. But actually, we've got a few problems. Words aren't all equally informative.\n", | |
"\n", | |
"If words appear too frequently in a single document, they're going to muck up our analysis. We want to perform some scaling of each of these term frequency vectors into something a bit more representative. In other words, we need to do some **vector normalizing**.\n", | |
"\n", | |
"We don't really have the time to go into the intense math of this. Just accept for now that we need to ensure that the L2 norm of each vector is equal to 1. Here's some code that shows how this is done." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import math\n", | |
"\n", | |
"def l2_normalizer(vec):\n", | |
" denom = np.sum([el**2 for el in vec])\n", | |
" return [(el / math.sqrt(denom)) for el in vec]\n", | |
"\n", | |
"doc_term_matrix_l2 = []\n", | |
"for vec in doc_term_matrix:\n", | |
" doc_term_matrix_l2.append(l2_normalizer(vec))\n", | |
"\n", | |
"print 'A regular old document term matrix: ' \n", | |
"print np.matrix(doc_term_matrix)\n", | |
"print '\\nA document term matrix with row-wise L2 norms of 1:'\n", | |
"print np.matrix(doc_term_matrix_l2)\n", | |
"\n", | |
"# if you want to check this math, perform the following:\n", | |
"# from numpy import linalg as la\n", | |
"# la.norm(doc_term_matrix[0])\n", | |
"# la.norm(doc_term_matrix_l2[0])" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"A regular old document term matrix: \n", | |
"[[2 0 1 0 0 2 0 1 0 1 1]\n", | |
" [2 0 1 0 1 1 1 0 0 1 1]\n", | |
" [0 1 0 1 1 0 0 0 1 1 1]]\n", | |
"\n", | |
"A document term matrix with row-wise L2 norms of 1:\n", | |
"[[ 0.57735027 0. 0.28867513 0. 0. 0.57735027\n", | |
" 0. 0.28867513 0. 0.28867513 0.28867513]\n", | |
" [ 0.63245553 0. 0.31622777 0. 0.31622777 0.31622777\n", | |
" 0.31622777 0. 0. 0.31622777 0.31622777]\n", | |
" [ 0. 0.40824829 0. 0.40824829 0.40824829 0. 0.\n", | |
" 0. 0.40824829 0.40824829 0.40824829]]\n" | |
] | |
} | |
], | |
"prompt_number": 13 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Not bad. Without getting too deeply mired into the linear algebra, you can see immediately that we've scaled down vectors such that each element is between [0, 1], without losing too much valuable information. You see how it's no longer the case that a term count of 1 is the same value in one vector as another?\n", | |
"\n", | |
"Why would we care about this kind of normalizing? Think about it this way; if you wanted to make a document seem more related to a particular topic than it actually was, you might try boosting the likelihood of its inclusion into a topic by repeating the same word over and over and over again. Frankly, at a certain point, we're getting a diminishing return on the informative value of the word. So we need to scale down words that appear too frequently in a document. \n", | |
"\n", | |
"## IDF frequency weighting\n", | |
"\n", | |
"But we're still not there yet. Just as all words aren't equally valuable *within* a document, not all words are valuable across *all documents*. We can try reweighting every word by its **inverse document frequency**. Let's see what's involved in that." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"def numDocsContaining(word, doclist):\n", | |
" doccount = 0\n", | |
" for doc in doclist:\n", | |
" if freq(word, doc) > 0:\n", | |
" doccount +=1\n", | |
" return doccount \n", | |
"\n", | |
"def idf(word, doclist):\n", | |
" n_samples = len(doclist)\n", | |
" df = numDocsContaining(word, doclist)\n", | |
" return np.log(n_samples / 1+df)\n", | |
"\n", | |
"my_idf_vector = [idf(word, mydoclist) for word in vocabulary]\n", | |
"\n", | |
"print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'\n", | |
"print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]\n", | |
"The inverse document frequency vector is [1.609438, 1.386294, 1.609438, 1.386294, 1.609438, 1.609438, 1.386294, 1.386294, 1.386294, 1.791759, 1.791759]\n" | |
] | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we have a general sense of information values per term in our vocabulary, accounting for their relative frequency across the entire corpus. Recall that this is an inverse! The more negative a term, the more frequent it is.\n", | |
"\n", | |
"We're almost there. To get TF-IDF weighted word vectors, you have to perform the simple calculation of tf * idf. \n", | |
"\n", | |
"Time to take a step back. Recall from linear algebra that if you multiply a vector of A x B by a vector of A x B, you're going to get a vector of size A x A, or a scalar. This won't do, since what we want is a term vector of the same dimensions (1 x number of terms), where each element has been scaled by this idf weighting. How do we do that in Python?\n", | |
"\n", | |
"We could write the whole function out here, but instead we're going to show a brief introduction into `numpy`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import numpy as np\n", | |
"\n", | |
"def build_idf_matrix(idf_vector):\n", | |
" idf_mat = np.zeros((len(idf_vector), len(idf_vector)))\n", | |
" np.fill_diagonal(idf_mat, idf_vector)\n", | |
" return idf_mat\n", | |
"\n", | |
"my_idf_matrix = build_idf_matrix(my_idf_vector)\n", | |
"\n", | |
"#print my_idf_matrix" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Awesome! Now we have converted our IDF vector into a matrix of size BxB, where the diagonal is the IDF vector. That means we can perform now multiply every term frequency vector by the inverse document frequency matrix. Then to make sure we are also accounting for words that appear too frequently within documents, we'll normalize each document such that the L2 norm = 1. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"doc_term_matrix_tfidf = []\n", | |
"\n", | |
"#performing tf-idf matrix multiplication\n", | |
"for tf_vector in doc_term_matrix:\n", | |
" doc_term_matrix_tfidf.append(np.dot(tf_vector, my_idf_matrix))\n", | |
"\n", | |
"#normalizing\n", | |
"doc_term_matrix_tfidf_l2 = []\n", | |
"for tf_vector in doc_term_matrix_tfidf:\n", | |
" doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))\n", | |
" \n", | |
"print vocabulary\n", | |
"print np.matrix(doc_term_matrix_tfidf_l2) # np.matrix() just to make it easier to look at" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"set(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more'])\n", | |
"[[ 0.57211257 0. 0.28605628 0. 0. 0.57211257\n", | |
" 0. 0.24639547 0. 0.31846153 0.31846153]\n", | |
" [ 0.62558902 0. 0.31279451 0. 0.31279451 0.31279451\n", | |
" 0.26942653 0. 0. 0.34822873 0.34822873]\n", | |
" [ 0. 0.36063612 0. 0.36063612 0.41868557 0. 0.\n", | |
" 0. 0.36063612 0.46611542 0.46611542]]\n" | |
] | |
} | |
], | |
"prompt_number": 16 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Awesome! You've just seen an example of how to tediously construct a TF-IDF weighted document term matrix. \n", | |
"\n", | |
"Here comes the best part: you don't even have to do this by hand. \n", | |
"\n", | |
"Remember that everything in Python is an object, objects take up memory (and performing actions take up time). Using scikits-learn ensures that you don't have to worry about the efficiency of all the previous steps.\n", | |
"\n", | |
"**NOTE**: The values you get from the `TfidfVectorizer/TfidfTransformer` will be different than what we have computed by hand. This is because scikits-learn uses an adapted version of Tfidf to deal with divide-by-zero errors. There is a more in-depth discussion [here](http://stackoverflow.com/a/18692538)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"\n", | |
"count_vectorizer = CountVectorizer(min_df=1)\n", | |
"term_freq_matrix = count_vectorizer.fit_transform(mydoclist)\n", | |
"print \"Vocabulary:\", count_vectorizer.vocabulary_\n", | |
"\n", | |
"from sklearn.feature_extraction.text import TfidfTransformer\n", | |
"\n", | |
"tfidf = TfidfTransformer(norm=\"l2\")\n", | |
"tfidf.fit(term_freq_matrix)\n", | |
"\n", | |
"tf_idf_matrix = tfidf.transform(term_freq_matrix)\n", | |
"print tf_idf_matrix.todense()\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}\n", | |
"[[ 0. 0. 0. 0. 0.28945906 0.\n", | |
" 0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]\n", | |
" [ 0. 0. 0. 0.41715759 0.3172591 0.3172591\n", | |
" 0. 0.3172591 0.6345182 0.24637999 0.24637999]\n", | |
" [ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358\n", | |
" 0. 0. 0. 0.28561676 0.28561676]]\n" | |
] | |
} | |
], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In fact, you can do this just by combining the steps into one: the TfidfVectorizer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"\n", | |
"tfidf_vectorizer = TfidfVectorizer(min_df = 1)\n", | |
"tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)\n", | |
"\n", | |
"print tfidf_matrix.todense()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[[ 0. 0. 0. 0. 0.28945906 0.\n", | |
" 0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]\n", | |
" [ 0. 0. 0. 0.41715759 0.3172591 0.3172591\n", | |
" 0. 0.3172591 0.6345182 0.24637999 0.24637999]\n", | |
" [ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358\n", | |
" 0. 0. 0. 0.28561676 0.28561676]]\n" | |
] | |
} | |
], | |
"prompt_number": 18 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And we can fit new observations into this vocabulary space like so:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']\n", | |
"new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)\n", | |
"print tfidf_vectorizer.vocabulary_\n", | |
"print new_term_freq_matrix.todense()\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}\n", | |
"[[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0.\n", | |
" 0. 0. 0. 0. ]\n", | |
" [ 0. 0.68091856 0. 0. 0.51785612 0.51785612\n", | |
" 0. 0. 0. 0. 0. ]\n", | |
" [ 0.62276601 0. 0. 0.62276601 0. 0. 0.\n", | |
" 0.4736296 0. 0. 0. ]]\n" | |
] | |
} | |
], | |
"prompt_number": 19 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that we didn't get words like 'watches' in the `new_term_freq_matrix`. That's because we *trained* the object on the documents in `mydoclist,` and that word doesn't appear in the vocabulary from that corpus. In other words, it's out of the lexicon." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Back to the Amazon review texts!\n", | |
"### Exercise 2\n", | |
"\n", | |
"Now it's time for you to try applying what you've learned. Try building a TF-IDF weighted document term matrix from the list of Amazon `review_text` strings using the `TfidfVectorizer`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import os\n", | |
"import csv\n", | |
"\n", | |
"#os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/')\n", | |
"\n", | |
"with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n", | |
" amazon_reader = csv.DictReader(csvfile, delimiter=',')\n", | |
" amazon_reviews = [row['review_text'] for row in amazon_reader]\n", | |
"\n", | |
" #your code here!!!" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 10 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment