Skip to content

Instantly share code, notes, and snippets.

@ixtel
Forked from bbengfort/tokens.py
Created October 19, 2015 12:58
Show Gist options
  • Save ixtel/104bd0cf17b3aab5a489 to your computer and use it in GitHub Desktop.
Save ixtel/104bd0cf17b3aab5a489 to your computer and use it in GitHub Desktop.
Getting a normalized FreqDist
import nltk
import string
def tokenize(text):
stopwords = set(nltk.corpus.stopwords.words('english'))
for token in nltk.word_tokenize(text):
if token in stopwords or token in string.punctuation:
continue
yield token.lower()
def count(text):
return nltk.FreqDist(tokenize(text))
if __name__ == "__main__":
for token, count in count("The cat in the hat sat on the cat mat, with aplumb.").values():
print "%s: %i" % (token, count)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment