- learning rate
- dropout
- max token
- clip norm
- tokenization method
- update freq (mini-batch with delayed update)
- optimizer
- learning rate scheduler
- warmup update
- warmup init learning rate
- min learning rate
- label smoothing
- quantization (fp16)
- share all embedding
- criterion (cross entropy)
- beam search size
- max length (ax+b where x is the original length)