Skip to content

Instantly share code, notes, and snippets.

import urllib2
import re
import sys
from collections import defaultdict
from random import random
"""
PLEASE DO NOT RUN THIS QUOTED CODE FOR THE SAKE OF daemonology's SERVER, IT IS
NOT MY SERVER AND I FEEL BAD FOR ABUSING IT. JUST GET THE RESULTS OF THE
CRAWL HERE: http://pastebin.com/raw.php?i=nqpsnTtW AND SAVE THEM TO "archive.txt"
# coding=UTF-8
from __future__ import division
import nltk
from collections import Counter
# This is a simple tool for adding automatic hashtags into an article title
# Created by Shlomi Babluki
# Sep, 2013

Text Classification

To demonstrate text classification with Scikit Learn, we'll build a simple spam filter. While the filters in production for services like Gmail will obviously be vastly more sophisticated, the model we'll have by the end of this chapter is effective and surprisingly accurate.

Spam filtering is the "hello world" of document classification, but something to be aware of is that we aren't limited to two classes. The classifier we will be using supports multi-class classification, which opens up vast opportunities like author identification, support email routing, etc… However, in this example we'll just stick to two classes: SPAM and HAM.

For this exercise, we'll be using a combination of the Enron-Spam data sets and the SpamAssassin public corpus. Both are publicly available for download and are retreived from the internet during the setup phase of the example code that goes with this chapter.

Loading Examples

@stephenLee
stephenLee / snippet.clj
Created August 23, 2013 07:50
Learn Clojure
; map
(map clojure.string/lower-case ["Java" "Imperative" "Weeping" "Clojure"])
(map * [1 2 3 4] [5 6 7 8])
; reduce
(reduce max [0 -3 10 48])
(reduce + 50 [1 2 3 4])
; partial
(def only-strings (partial filter string?))
import spark.SparkContext
import SparkContext._
/**
* A port of [[http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/]]
* to Spark.
* Uses movie ratings data from MovieLens 100k dataset found at [[http://www.grouplens.org/node/73]]
*/
object MovieSimilarities {
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
def get_vectors(vocab_size=5000):
newsgroups_train = fetch_20newsgroups(subset='train')
vectorizer = CountVectorizer(max_df=.9, max_features=vocab_size)
vecs = vectorizer.fit_transform(newsgroups_train.data)
vocabulary = vectorizer.vocabulary
terms = np.array(vocabulary.keys())
# Dirichlet process Gaussian mixture model
import numpy as np
from scipy.special import gammaln
from scipy.linalg import cholesky
from sliceSample import sliceSample
def multinomialDraw(dist):
"""Returns a single draw from the given multinomial distribution."""
return np.random.multinomial(1, dist).argmax()
@stephenLee
stephenLee / argmin_max.tex
Created December 15, 2012 08:44
latex argmax and argmin
\DeclareMathOperator*{\argmax}{arg\,max} % in your preamble
\DeclareMathOperator*{\argmin}{arg\,min} % in your preamble
\argmax_{...} % in your formula
\argmin_{...} % in your formula
@stephenLee
stephenLee / index.html
Created December 9, 2012 14:06
Renren friends collections
<!DOCTYPE html>
<meta charset="utf-8">
<script src="http://d3js.org/d3.v2.min.js?2.9.3"></script>
<style>
.link {
stroke: #ccc;
}
.node text {
@stephenLee
stephenLee / virtual.md
Created December 5, 2012 15:44
virtualenvwrapper usages
  • source /usr/local/bin/virtualenvwrapper.sh
    
  • mkvirtualenv env1
    
  • ls $WORKON_HOME
    
  • lssitepackages
    
  • Switch environments with workon workon env2
  • deactive
    
  • rmvirtualenv env2