This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/// Fix a huggingface tokenizer to which tokens have been added after training. | |
/// | |
/// Adding tokens after training via `add_special_tokens` leads to them being added to the | |
/// `added_tokens` section but not to the `model.vocab` section. This yields warnings like: | |
/// ``` | |
/// [2023-10-17T07:54:05Z WARN tokenizers::tokenizer::serialization] Warning: Token '<|empty_usable_token_space_1023|>' was expected to have ID '129023' but was given ID 'None' | |
/// ``` | |
/// The code in this file ensures that all tokens from `added_tokens` are also placed into | |
/// `model.vocab`. This fixes the warning and does not change the tokenizer's behavior. |