Skip to content

Instantly share code, notes, and snippets.

@marcelcaraciolo
Created April 19, 2013 20:49
Show Gist options
  • Save marcelcaraciolo/5423131 to your computer and use it in GitHub Desktop.
Save marcelcaraciolo/5423131 to your computer and use it in GitHub Desktop.
import re import math
def getwords(doc):
splitter=re.compile('\\W*') # Split the words by non-alpha characters
words=[s.lower() for s in splitter.split(doc) if len(s)>2 and len(s)<20]
# Return the unique set of words only
return dict([(w,1) for w in words])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment