Last active
April 7, 2025 20:13
-
-
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Convert tiktoken tokenizers to the Hugging Face tokenizers format
Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use
hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')
Rather than GPT2TokenizerFast
(which then generates a warning).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).
It also sets the default
clean_up_tokenization_spaces
toFalse
(thanks @binxuan for pointing that out).So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate: