Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created May 9, 2019 15:46
Show Gist options
  • Select an option

  • Save gaphex/08aeebbb8ee7129f9508b1b02f10f814 to your computer and use it in GitHub Desktop.

Select an option

Save gaphex/08aeebbb8ee7129f9508b1b02f10f814 to your computer and use it in GitHub Desktop.
def read_sentencepiece_vocab(filepath):
voc = []
with open(filepath, encoding='utf-8') as fi:
for line in fi:
voc.append(line.split("\t")[0])
# skip the first <unk> token
voc = voc[1:]
return voc
snt_vocab = read_sentencepiece_vocab("{}.vocab".format(MODEL_PREFIX))
print("Learnt vocab size: {}".format(len(snt_vocab)))
print("Sample tokens: {}".format(random.sample(snt_vocab, 10)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment