Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created May 9, 2019 15:54
Show Gist options
  • Save gaphex/2432d1c031d5e140e3c1ff77359a4d26 to your computer and use it in GitHub Desktop.
Save gaphex/2432d1c031d5e140e3c1ff77359a4d26 to your computer and use it in GitHub Desktop.
def parse_sentencepiece_token(token):
if token.startswith("▁"):
return token[1:]
else:
return "##" + token
bert_vocab = list(map(parse_sentencepiece_token, snt_vocab))
ctrl_symbols = ["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]"]
bert_vocab = ctrl_symbols + bert_vocab
bert_vocab += ["[UNUSED_{}]".format(i) for i in range(VOC_SIZE - len(bert_vocab))]
print(len(bert_vocab))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment