Skip to content

Instantly share code, notes, and snippets.

@lxneng
Created March 13, 2010 12:48
Show Gist options
  • Save lxneng/331304 to your computer and use it in GitHub Desktop.
Save lxneng/331304 to your computer and use it in GitHub Desktop.
def getwords(doc):
splitter=re.compile('\\W*')
# Split the words by non-alpha characters
words=[s.lower() for s in splitter.split(doc)
if len(s)>2 and len(s)<20]
# Return the unique set of words only
return dict([(w,1) for w in words])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment