Skip to content

Instantly share code, notes, and snippets.

@bbengfort
Created October 24, 2014 15:40
Show Gist options
  • Save bbengfort/8f86631688a7186bf663 to your computer and use it in GitHub Desktop.
Save bbengfort/8f86631688a7186bf663 to your computer and use it in GitHub Desktop.
Getting a normalized FreqDist
import nltk
import string
def tokenize(text):
stopwords = set(nltk.corpus.stopwords.words('english'))
for token in nltk.word_tokenize(text):
if token in stopwords or token in string.punctuation:
continue
yield token.lower()
def count(text):
return nltk.FreqDist(tokenize(text))
if __name__ == "__main__":
for token, count in count("The cat in the hat sat on the cat mat, with aplumb.").values():
print "%s: %i" % (token, count)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment