Skip to content

Instantly share code, notes, and snippets.

@language-engineering
Created October 4, 2012 17:17
Show Gist options
  • Select an option

  • Save language-engineering/3835060 to your computer and use it in GitHub Desktop.

Select an option

Save language-engineering/3835060 to your computer and use it in GitHub Desktop.
from sussex_nltk.corpus_readers import ReutersCorpusReader
from sussex_nltk.stats import expected_token_freq
rcr = ReutersCorpusReader()
sample_size = 1000 #The number of sentences in a sample
#Randomly sample 1000 sentences, and get a list of the tokens in those sentences
tokens = rcr.sample_words_by_sents(sample_size)
#Calculate and print the expected token frequency for this one sample of tokens for the token "elephant"
print "Expected token frequency per 5000 tokens %s" % expected_token_freq(tokens,"elephant")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment