Skip to content

Instantly share code, notes, and snippets.

@inspirit941
Created January 18, 2020 08:35
Show Gist options
  • Save inspirit941/bde0cf7326379f48d9b678c12f8c72dd to your computer and use it in GitHub Desktop.
Save inspirit941/bde0cf7326379f48d9b678c12f8c72dd to your computer and use it in GitHub Desktop.
# data/data_preprocess/tokenization.py line 85
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
# print(type(vocab))
for item in items:
# 사전에 없는 단어면 Exception을 띄우는 대신, unknown 토큰인 [UNK]를 반환하도록 변경해 줬다.
if item not in vocab:
vocab[item] = vocab["[UNK]"]
output.append(vocab[item])
return output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment