Skip to content

Instantly share code, notes, and snippets.

View josephmisiti's full-sized avatar

Joseph Misiti josephmisiti

View GitHub Profile

Text Classification

To demonstrate text classification with Scikit Learn, we'll build a simple spam filter. While the filters in production for services like Gmail will obviously be vastly more sophisticated, the model we'll have by the end of this chapter is effective and surprisingly accurate.

Spam filtering is the "hello world" of document classification, but something to be aware of is that we aren't limited to two classes. The classifier we will be using supports multi-class classification, which opens up vast opportunities like author identification, support email routing, etc… However, in this example we'll just stick to two classes: SPAM and HAM.

For this exercise, we'll be using a combination of the Enron-Spam data sets and the SpamAssassin public corpus. Both are publicly available for download and are retreived from the internet during the setup phase of the example code that goes with this chapter.

Loading Examples

// Ovveride Underscore.js' template format
_.templateSettings = {
evaluate: /\{%([\s\S]+?)%\}/g,
interpolate: /\{\{(.+?)\}\}/g
};
from gensim import corpora, models, similarities, utils
import logging
import os
import re
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class DirectoryCorpus(corpora.TextCorpus):
def get_texts(self):

Dustin Moskovitz is a cofounder of Asana, and was a co-founder of Facebook. Here are some of his favorite books.

Many thanks to Dustin for this list.

On Consciousness

Douglas R. Hofstadter

Godel, Escher, Bach: An Eternal Golden Braid

sudo apt-get -y update
sudo apt-get -y upgrade
sudo apt-get -y dist-upgrade
sudo apt-get -y install git make python-dev python-setuptools libblas-dev gfortran g++ python-pip python-numpy python-scipy liblapack-dev
sudo pip install ipython nose
sudo apt-get install screen
sudo pip install --upgrade git+git://github.com/Theano/Theano.git
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1204/x86_64/cuda-repo-ubuntu1204_5.5-0_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1204_5.5-0_amd64.deb
sudo apt-get update
@josephmisiti
josephmisiti / dnn.py
Created August 10, 2014 21:48 — forked from syhw/dnn.py
"""
A deep neural network with or w/o dropout in one file.
"""
import numpy, theano, sys, math
from theano import tensor as T
from theano import shared
from theano.tensor.shared_randomstreams import RandomStreams
from collections import OrderedDict
# Author: Pieter Noordhuis
# Description: Simple demo to showcase Redis PubSub with EventMachine
#
# Update 7 Oct 2010:
# - This example does *not* appear to work with Chrome >=6.0. Apparently,
# the WebSocket protocol implementation in the cramp gem does not work
# well with Chrome's (newer) WebSocket implementation.
#
# Requirements:
# - rubygems: eventmachine, thin, cramp, sinatra, yajl-ruby
@josephmisiti
josephmisiti / get.sh
Last active August 29, 2015 14:07 — forked from imjasonh/get.sh
#!/bin/bash
set -e
set -x
LIMIT=10000
for h in `seq 0 23`
do
for d in `seq 1 7`
@josephmisiti
josephmisiti / introrx.md
Last active August 29, 2015 14:08 — forked from staltz/introrx.md

The introduction to Reactive Programming you've been missing

(by @andrestaltz)

So you're curious in learning this new thing called (Functional) Reactive Programming (FRP).

Learning it is hard, even harder by the lack of good material. When I started, I tried looking for tutorials. I found only a handful of practical guides, but they just scratched the surface and never tackled the challenge of building the whole architecture around it. Library documentations often don't help when you're trying to understand some function. I mean, honestly, look at this:

Rx.Observable.prototype.flatMapLatest(selector, [thisArg])

Projects each element of an observable sequence into a new sequence of observable sequences by incorporating the element's index and then transforms an observable sequence of observable sequences into an observable sequence producing values only from the most recent observable sequence.

@josephmisiti
josephmisiti / README.md
Last active August 29, 2015 14:09 — forked from mbostock/.block

Source: American Community Survey, 2011 5-Year Estimate

This map was inspired by a similar map found on Wikipedia. I wasn’t wild about the diverging color scale, so I thought it would be a fun challenge to recreate.

Finding the shapefiles was easy; I used the U.S. National Atlas 1:1,000,000 scale dataset, conveniently packaged in my U.S. Atlas repository. I reprojected the shapefiles to the California Albers projection using ogr2ogr:

ogr2ogr \
	-f 'ESRI Shapefile' \
	-t_srs 'EPSG:3310' \