stephenLee

Text Classification

To demonstrate text classification with Scikit Learn, we'll build a simple spam filter. While the filters in production for services like Gmail will obviously be vastly more sophisticated, the model we'll have by the end of this chapter is effective and surprisingly accurate.

Spam filtering is the "hello world" of document classification, but something to be aware of is that we aren't limited to two classes. The classifier we will be using supports multi-class classification, which opens up vast opportunities like author identification, support email routing, etc… However, in this example we'll just stick to two classes: SPAM and HAM.

For this exercise, we'll be using a combination of the Enron-Spam data sets and the SpamAssassin public corpus. Both are publicly available for download and are retreived from the internet during the setup phase of the example code that goes with this chapter.

Loading Examples

source /usr/local/bin/virtualenvwrapper.sh

```
mkvirtualenv env1
```
```
ls $WORKON_HOME
```
```
lssitepackages
```
Switch environments with workon workon env2
```
deactive
```
```
rmvirtualenv env2
```

	import urllib2
	import re
	import sys
	from collections import defaultdict
	from random import random

	"""
	PLEASE DO NOT RUN THIS QUOTED CODE FOR THE SAKE OF daemonology's SERVER, IT IS
	NOT MY SERVER AND I FEEL BAD FOR ABUSING IT. JUST GET THE RESULTS OF THE
	CRAWL HERE: http://pastebin.com/raw.php?i=nqpsnTtW AND SAVE THEM TO "archive.txt"

	# coding=UTF-8
	from __future__ import division
	import nltk
	from collections import Counter

	# This is a simple tool for adding automatic hashtags into an article title
	# Created by Shlomi Babluki
	# Sep, 2013

	; map
	(map clojure.string/lower-case ["Java" "Imperative" "Weeping" "Clojure"])
	(map * [1 2 3 4] [5 6 7 8])

	; reduce
	(reduce max [0 -3 10 48])
	(reduce + 50 [1 2 3 4])

	; partial
	(def only-strings (partial filter string?))

	import spark.SparkContext
	import SparkContext._

	/**
	* A port of [[http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/]]
	* to Spark.
	* Uses movie ratings data from MovieLens 100k dataset found at [[http://www.grouplens.org/node/73]]
	*/
	object MovieSimilarities {

	import numpy as np
	from sklearn.datasets import fetch_20newsgroups
	from sklearn.feature_extraction.text import CountVectorizer

	def get_vectors(vocab_size=5000):
	newsgroups_train = fetch_20newsgroups(subset='train')
	vectorizer = CountVectorizer(max_df=.9, max_features=vocab_size)
	vecs = vectorizer.fit_transform(newsgroups_train.data)
	vocabulary = vectorizer.vocabulary
	terms = np.array(vocabulary.keys())

	# Dirichlet process Gaussian mixture model

	import numpy as np
	from scipy.special import gammaln
	from scipy.linalg import cholesky
	from sliceSample import sliceSample

	def multinomialDraw(dist):
	"""Returns a single draw from the given multinomial distribution."""
	return np.random.multinomial(1, dist).argmax()

	\DeclareMathOperator*{\argmax}{arg\,max} % in your preamble
	\DeclareMathOperator*{\argmin}{arg\,min} % in your preamble

	\argmax_{...} % in your formula
	\argmin_{...} % in your formula

	<!DOCTYPE html>
	<meta charset="utf-8">
	<script src="http://d3js.org/d3.v2.min.js?2.9.3"></script>
	<style>

	.link {
	stroke: #ccc;
	}

	.node text {