Skip to content

Instantly share code, notes, and snippets.

@negedng
Created October 18, 2020 21:49
Show Gist options
  • Save negedng/66a1a43b266d3ff0fe49a8ef9a7b6e71 to your computer and use it in GitHub Desktop.
Save negedng/66a1a43b266d3ff0fe49a8ef9a7b6e71 to your computer and use it in GitHub Desktop.
Tokenizer from the frequency list
# Formating vocab dictionary from the most common words
vocab_dict = {k:i+4 for i,k in enumerate([l for l,m in vocabulary_counter.most_common(20000-4)])}
# Adding the special characters
vocab_dict["[PAD]"]=0
vocab_dict["[UNK]"]=1
vocab_dict["[CLS]"]=2
vocab_dict["[SEP]"]=3
vocab_dict["[MASK]"]=4
tokenizer_2 = BertWordPieceTokenizer(vocab_dict)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment