Julius Kreuzer jneuff

jneuff / fix-tokenizer.rs

Created October 17, 2023 11:35

Fix a huggingface tokenizer to which tokens have been added after training

	/// Fix a huggingface tokenizer to which tokens have been added after training.
	///
	/// Adding tokens after training via `add_special_tokens` leads to them being added to the
	/// `added_tokens` section but not to the `model.vocab` section. This yields warnings like:
	/// ```
	/// [2023-10-17T07:54:05Z WARN tokenizers::tokenizer::serialization] Warning: Token '<\|empty_usable_token_space_1023\|>' was expected to have ID '129023' but was given ID 'None'
	/// ```
	/// The code in this file ensures that all tokens from `added_tokens` are also placed into
	/// `model.vocab`. This fixes the warning and does not change the tokenizer's behavior.