Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created May 9, 2019 15:15
Show Gist options
  • Select an option

  • Save gaphex/e0abfb90061ac9f187ef5c530e361c93 to your computer and use it in GitHub Desktop.

Select an option

Save gaphex/e0abfb90061ac9f187ef5c530e361c93 to your computer and use it in GitHub Desktop.
MODEL_PREFIX = "tokenizer" #@param {type: "string"}
VOC_SIZE = 32000 #@param {type:"integer"}
SUBSAMPLE_SIZE = 12800000 #@param {type:"integer"}
NUM_PLACEHOLDERS = 256 #@param {type:"integer"}
SPM_COMMAND = ('--input={} --model_prefix={} '
'--vocab_size={} --input_sentence_size={} '
'--shuffle_input_sentence=true '
'--bos_id=-1 --eos_id=-1').format(
PRC_DATA_FPATH, MODEL_PREFIX,
VOC_SIZE - NUM_PLACEHOLDERS, SUBSAMPLE_SIZE)
spm.SentencePieceTrainer.Train(SPM_COMMAND)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment