Created
November 21, 2013 06:53
-
-
Save rjweiss/7577087 to your computer and use it in GitHub Desktop.
textanalysis
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "textanalysis.ipynb" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Basic Text Analysis\n", | |
"\n", | |
"We're going to cover two areas of text analysis: computing similarity between documents and training a very simple text classification." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Similarity metrics\n", | |
"\n", | |
"An early exploratory level analysis that you might want to perform is *similarity*. How similar are your documents to each other? Once you are operating under the vector space model, you can compute different types of distances between word vectors; Euclidean, Hamming, Jaccard, cosine, and so forth. \n", | |
"\n", | |
"It's currently quite popular to use cosine distances, so let's review what a cosine distance is and how to compute it.\n", | |
"\n", | |
"### Cosine distance\n", | |
"\n", | |
"Think back to your days in grade school geometry. Cosine is a measure of an angle between two vectors. If the vectors are heading towards points very distant from the other, the angle between the two vectors will be very large. If they are heading towards a similar point in space, the angle will be very small.\n", | |
"\n", | |
"Since we have represented documents as word vectors, we can find the cosine distance between documents and treat the resulting angle as an estimate of similarity based on word occurrences and frequencies. Similarity tends to be recalculated as `1 - distance` so a cosine score of 1.0 means exactly identical.\n", | |
"\n", | |
"Here are some excellent, more in-depth summaries of the math behind cosine distances as they relate to text in vector space:\n", | |
" \n", | |
"- [Justin Grimmer's Text as Data class](http://stanford.edu/~jgrimmer/tc5.pdf)\n", | |
"- [Stanford's NLP group on cosine scores](http://nlp.stanford.edu/IR-book/html/htmledition/computing-vector-scores-1.html)\n", | |
" \n", | |
"We could go into more detail about how to compute cosine distance, but luckily scikits-learn has made this easy for us. We can multiply the tfidf sparse matrix object by its transpose and get cosine distances." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#according to one of the guys who helped write the tfidf implementation in scikits-learn: http://stackoverflow.com/a/8897648\n", | |
"from sklearn.feature_extraction.text import TfidfVectorizer\n", | |
"\n", | |
"mydoclist = ['Julie loves me more than Linda loves me',\n", | |
"'Jane likes me more than Julie loves me',\n", | |
"'He likes basketball more than baseball']\n", | |
"\n", | |
"tfidf_vectorizer = TfidfVectorizer(min_df = 1)\n", | |
"tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)\n", | |
"\n", | |
"document_distances = (tfidf_matrix * tfidf_matrix.T)\n", | |
"print 'Created a ' + str(document_distances.get_shape()[0]) + ' by ' + str(document_distances.get_shape()[1]) + ' document-document cosine distance matrix.'\n", | |
"print document_distances.toarray()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Created a 3 by 3 document-document cosine distance matrix.\n", | |
"[[ 1. 0.75360253 0.12840803]\n", | |
" [ 0.75360253 1. 0.2574232 ]\n", | |
" [ 0.12840803 0.2574232 1. ]]\n" | |
] | |
} | |
], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Finding similar documents\n", | |
"\n", | |
"So now you have documents as word vectors. You want to find the ones that are really similar to each other.\n", | |
"\n", | |
"Here's a tip: the cosine distance between two vectors is a reasonably easy way to perform a first test of document similarity.\n", | |
"\n", | |
"Think back to your days in grade school geometry. Cosine is a measure of an angle between two vectors of some length. If the vectors are heading towards points very distant from the other, the angle between the two vectors will be very large. If they are heading towards a similar point in space, the angle will be very small.\n", | |
"\n", | |
"Since we have represented documents as word vectors, we can find the cosine distance between documents and treat the resulting angle as an estimate of similarity based on tf-idf weighted word vectors.\n", | |
"\n", | |
"This is how cheating detection works, in a nutshell." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn.metrics.pairwise import linear_kernel\n", | |
"\n", | |
"#code taken from here: http://stackoverflow.com/a/12128777\n", | |
"from sklearn.metrics.pairwise import linear_kernel \n", | |
"\n", | |
"#linear kernel is the same as cosine distance when using tfidf + euclidean normalized vectors (L2 Norm=1))\n", | |
"#this is the benefit of sticking to scikits-learn from beginning to end of an analysis\n", | |
"\n", | |
"cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten() # let's look at similarity to the very first document\n", | |
"related_docs_indices = cosine_similarities.argsort()[:-len(mydoclist)-1:-1] #what is the order of most to least similar?\n", | |
"\n", | |
"print mydoclist\n", | |
"print cosine_similarities[related_docs_indices] # what are the cosine distances?" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"['Julie loves me more than Linda loves me', 'Jane likes me more than Julie loves me', 'He likes basketball more than baseball']\n", | |
"[ 1. 0.75360253 0.12840803]\n" | |
] | |
} | |
], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Classifying documents\n", | |
"\n", | |
"First, let's create a massive list of dictionaries, where each dictionary is `{field_source: review_text}`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import os\n", | |
"import csv\n", | |
"#os.chdir('/path/to/wherever/you/downloaded/data/from/textcleaning')\n", | |
"os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text2/extra/amazon')\n", | |
"\n", | |
"amazon_reviews = []\n", | |
"target_labels = []\n", | |
"\n", | |
"for infile in os.listdir(os.path.join(os.getcwd())):\n", | |
" if infile.endswith('csv'):\n", | |
" label = infile.split('.')[0]\n", | |
" target_labels.append(label)\n", | |
" \n", | |
" with open(infile, 'rb') as csvfile:\n", | |
" amazon_reader = csv.DictReader(csvfile, delimiter=',')\n", | |
" infile_rows = [{ label: row['review_text'] } for row in amazon_reader]\n", | |
" \n", | |
" for doc in infile_rows:\n", | |
" amazon_reviews.append(doc)\n", | |
" \n", | |
"print 'There are ' + str(len(amazon_reviews)) + ' total reviews.'\n", | |
"\n", | |
"print 'The labels are '+ ', '.join(target_labels) + '.'" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"There are 5522 total reviews.\n", | |
"The labels are biologicalsciences_2010, literature_2010, sociology_2010.\n" | |
] | |
} | |
], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So now we have a key-value mapping of label-body text for all the Amazon reviews. Let's look into building a classifier based on this text." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"#first, we need to shuffle the docs into random order\n", | |
"#this is to make it easier for me to make train and test sets\n", | |
"\n", | |
"from random import shuffle\n", | |
"x = [amazon_reviews[i] for i in range(len(amazon_reviews))]\n", | |
"shuffle(x)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn.naive_bayes import MultinomialNB\n", | |
"from sklearn import metrics\n", | |
"from operator import itemgetter\n", | |
"\n", | |
"trainset_size = int(round(len(amazon_reviews)*0.75)) # i chose this threshold arbitrarily...\n", | |
"print 'The training set size for this classifier is ' + str(trainset_size) + '\\n'\n", | |
"\n", | |
"X_train = np.array([''.join(el.values()) for el in x[0:trainset_size]])\n", | |
"y_train = np.array([''.join(el.keys()) for el in x[0:trainset_size]])\n", | |
"\n", | |
"X_test = np.array([''.join(el.values()) for el in x[trainset_size+1:len(amazon_reviews)]]) \n", | |
"y_test = np.array([''.join(el.keys()) for el in x[trainset_size+1:len(amazon_reviews)]]) \n", | |
"\n", | |
"vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1, 1), stop_words='english', strip_accents='unicode', norm='l2')\n", | |
" \n", | |
"X_train = vectorizer.fit_transform(X_train)\n", | |
"X_test = vectorizer.transform(X_test)\n", | |
"\n", | |
"classifier = MultinomialNB().fit(X_train, y_train)\n", | |
"y_predicted = classifier.predict(X_test)\n", | |
"\n", | |
"print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_predicted))\n", | |
"print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_predicted))\n", | |
"print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_predicted))\n", | |
"\n", | |
"#hey, not bad! shouldn't be surprising; there's a lot of data\n", | |
"#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)\n", | |
"#we could also modify the vectorizer to stem or lemmatize\n", | |
"print '\\nHere is the confusion matrix:'\n", | |
"print metrics.confusion_matrix(y_test, y_predicted)\n", | |
"\n", | |
"#What are the top N most predictive features per class?\n", | |
"N = 10\n", | |
"vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])\n", | |
"\n", | |
"for i, label in enumerate(target_labels):\n", | |
" topN = np.argsort(classifier.coef_[i])[-N:]\n", | |
" print \"\\nThe top %d most informative features for %s: \\n%s\" % (N, label, \" \".join(vocabulary[topN]))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"The training set size for this classifier is 4142\n", | |
"\n", | |
"The precision for this classifier is 0.764007077933" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"\n", | |
"The recall for this classifier is 0.800580130529\n", | |
"The f1 for this classifier is 0.722836562295\n", | |
"\n", | |
"Here is the confusion matrix:\n", | |
"[[ 18 169 0]\n", | |
" [ 0 1086 0]\n", | |
" [ 0 106 0]]\n", | |
"\n", | |
"The top 10 most informative features for biologicalsciences_2010: \n", | |
"information gavin women intuition violence gift becker read fear book\n", | |
"\n", | |
"The top 10 most informative features for literature_2010: \n", | |
"like reading time novel quot story great gatsby read book\n", | |
"\n", | |
"The top 10 most informative features for sociology_2010: \n", | |
"really turow stickers lerner relationships dance quot read book anger\n" | |
] | |
} | |
], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#etcML demonstration!\n", | |
"First, let's reformat the Amazon data into one of the two structures that etcML expects" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import os\n", | |
"import zipfile\n", | |
"\n", | |
"for review in amazon_reviews:\n", | |
" label = ''.join(review.keys())\n", | |
" text = ''.join(review.values())\n", | |
" \n", | |
" etcMLdir = os.path.join(os.getcwd() + '/etcML/' + label)\n", | |
" \n", | |
" if not os.path.exists(etcMLdir):\n", | |
" try:\n", | |
" os.makedirs(etcMLdir)\n", | |
" except OSError:\n", | |
" print \"Skipping creation of %s because it exists already.\" % etcMLdir\n", | |
" \n", | |
" #would probably be better to create a dictionary that stores the DOI and then names the file the DOI rather than the index number\n", | |
" with open(os.path.join(etcMLdir + os.sep + 'review_' + str(amazon_reviews.index(review)) + '.txt'), 'wb' ) as outfile:\n", | |
" outfile.write(text)\n", | |
"\n", | |
"#note that it wasn't really necessary to write these files out to a directory first...\n", | |
"#we could have written a function that added to a zipfile dynamically\n", | |
" \n", | |
"def zipdir(path, zip):\n", | |
" for root, dirs, files in os.walk(path):\n", | |
" for file in files:\n", | |
" zip.write(os.path.join(root, file))\n", | |
"\n", | |
"zip = zipfile.ZipFile('amazon.zip', 'w')\n", | |
"zipdir('etcML/', zip)\n", | |
"zip.close()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 6 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment