Skip to content

Instantly share code, notes, and snippets.

@StrikingLoo
Created October 25, 2019 04:04
Show Gist options
  • Save StrikingLoo/8d35fd997bffcee31d45232e1c8476e1 to your computer and use it in GitHub Desktop.
Save StrikingLoo/8d35fd997bffcee31d45232e1c8476e1 to your computer and use it in GitHub Desktop.
corpus_words = corpus.split(' ')
corpus_words= [word for word in corpus_words if word != '']
corpus_words # [...'a', 'wyvern', ',', 'two', 'of', 'the', 'thousand'...]
len(corpus_words) # 2185920
distinct_words = list(set(corpus_words))
word_idx_dict = {word: i for i, word in enumerate(distinct_words)}
distinct_words_count = len(list(set(corpus_words)))
distinct_words_count # 32663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment