Convert tiktoken tokenizers to the Hugging Face tokenizers format

I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).

It also sets the default clean_up_tokenization_spaces to False (thanks @binxuan for pointing that out).

So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate:

import tqdm
from datasets import load_dataset
import tiktoken
from transformers import GPT2TokenizerFast

hf_tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
og_tokenizer = tiktoken.encoding_for_model('gpt-4')

dataset = load_dataset('xnli', 'all_languages')

for item in tqdm.tqdm(dataset['train']):
    for string in item['premise'].values():
        encoded1 = og_tokenizer.encode(string)
        encoded2 = hf_tokenizer.encode(string)

        assert encoded1 == encoded2, f'encoding "{string}" is incorrect. "{encoded1}" != "{encoded2}"'

        decoded1 = og_tokenizer.decode(encoded1)
        decoded2 = hf_tokenizer.decode(encoded2, skip_special_tokens=True)

        assert decoded1 == decoded2, f'decoding "{string}" is incorrect. "{decoded1}" != "{decoded2}"'

xenova/tiktoken-to-hf.ipynb

xenova commented Mar 27, 2024 •

edited

Loading

Uh oh!

david-waterworth commented Apr 30, 2024

Uh oh!

xenova/tiktoken-to-hf.ipynb

xenova commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-waterworth commented Apr 30, 2024

Uh oh!

xenova commented Mar 27, 2024 •

edited

Loading