Skip to content

Instantly share code, notes, and snippets.

@mbednarski
Last active May 28, 2022 15:43
Show Gist options
  • Save mbednarski/f45182602e33547f2619046498649dd6 to your computer and use it in GitHub Desktop.
Save mbednarski/f45182602e33547f2619046498649dd6 to your computer and use it in GitHub Desktop.
vocabulary = []
for sentence in tokenized_corpus:
for token in sentence:
if token not in vocabulary:
vocabulary.append(token)
word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}
vocabulary_size = len(vocabulary)
@unedited-despair
Copy link

I did:

vocabulary = []
[vocabulary.append(t) for t in s if t not in vocabulary] for s in tokenized_corpus]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment