Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active March 22, 2025 07:16
Show Gist options
  • Save AaradhyaSaxena/add627135ef6ed2aa7b995695baa4c15 to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/add627135ef6ed2aa7b995695baa4c15 to your computer and use it in GitHub Desktop.
Langauge models

Tokenizer

karpathy: https://www.youtube.com/watch?v=zduSFxRajkE

References:

GPT2 tokenizer (tiktokenizer)

Many unexpected behaviors and "oddness" in LLMs can be traced back to tokenization.

  • Even seemingly simple LLM limitations, like difficulty with spelling tasks, often stem from tokenization. Words might be broken down into tokens in ways that don't align with character-level understanding.
  • Simple string processing can be difficult for LLMs due to tokenization. The way text is chunked into tokens might not intuitively match how humans perform basic string manipulations.
  • Non-English languages can perform worse in LLMs, partly due to tokenization. Often, the same sentence in a non-English language will be broken down into a significantly larger number of tokens compared to English. This "bloats up the sequence length" and can lead to the model running out of context within its maximum window size. The tokenizer training data often has a strong bias towards English.
  • LLMs can struggle with simple arithmetic due to how numbers are tokenized. Numbers might be split into tokens in arbitrary ways (e.g., "677" as two tokens, "127" as one), making it harder for the model to perform digit-by-digit calculations.
  • GPT-2 specifically had more difficulties with Python code compared to later models. This is partly because the GPT-2 tokenizer inefficiently handles whitespace, treating each space as a separate token. Since Python heavily relies on indentation using spaces, this leads to a very long sequence of tokens for even short code snippets, quickly exceeding the context length.
    • GPT-4 tokenizer had twice as many tokens and performed a lot better, b/c series of indentations in python were now single token rather than 4-5 " " characters in series, which used to bloat the sentence a lot, and used to cause running out of context window.
  • Weird warnings about trailing whitespace in prompts often arise due to how tokenizers, like GPT-2's, handle spaces. A trailing space might be treated as a separate token, which can deviate from how the model was trained on sequences where spaces are typically part of the following word token.
  • Early versions of GPT models exhibited bizarre behavior when presented with specific phrases like "solid gold Magikarp." This is likely due to these phrases being rare in the language model's training data but potentially being a single, specific token in the tokenizer's vocabulary (e.g., if it was a frequent username in the tokenizer's training data like Reddit). When the model encounters this untrained token, it can lead to unpredictable and nonsensical outputs.
  • The choice of data format can impact token efficiency. For example, YAML can represent the same structured data in fewer tokens than JSON when using GPT-style tokenizers.
  • The inconsistency in how the same concept ("egg") is tokenized based on its context (beginning of a sentence, preceded by a space, capitalization) highlights the complexities that the language model must learn to handle. The case sensitivity of tokens also means "egg" and "Egg" are different tokens.

Affect of tokenizer on attention

The Concept of Attention in Transformers: At its core, the attention mechanism allows a language model to weigh the importance of different parts of the input sequence when processing information to predict the next token. In the Transformer architecture, each token in the input attends to all other tokens (or a subset of them in more efficient variants) and calculates a weighted sum of their representations. These weights determine how much influence each token has on the representation of the current token being processed. This allows the model to capture long-range dependencies and understand the context within the input sequence. The context size determines the maximum number of preceding tokens that each token can attend to.

GPT-2 and the Impact of its Tokenizer on Attention:

  • Longer Sequences for Non-English Languages: The GPT-2 tokenizer often breaks down non-English text into significantly more tokens than English for the same content. This "bloats up the sequence length". Since the attention mechanism has a finite context window (1,024 tokens for GPT-2), a longer sequence means the model can effectively attend to less of the original text's content. The model might run out of context, limiting its ability to understand dependencies in the non-English input.
  • Inefficient Whitespace Handling (Especially in Code like Python): GPT-2 treats each whitespace character as a separate token. In Python, where indentation is crucial, this leads to extremely long sequences of tokens for even short code snippets. This "wasteful tokenization" consumes valuable context length with semantically less important tokens (individual spaces). As a result, the model has a shorter effective context window for understanding the actual code logic and attending to relevant parts of the code.
  • Arbitrary Tokenization of Numbers: GPT-2's tokenizer can split numbers into tokens in seemingly arbitrary ways (e.g., "677" as two tokens, "127" as one). When a number is fragmented into multiple tokens, the model's attention mechanism has to work harder to attend to these individual parts and reconstruct the numerical value and its context before performing any reasoning or arithmetic.

GPT-4 and the Improved Impact of its Tokenizer on Attention:

  • Denser Representations and Increased Vocabulary: GPT-4's tokenizer has a significantly larger vocabulary size (roughly 100,000 tokens compared to GPT-2's roughly 50,000). This allows for denser encoding of text, meaning the same amount of information can often be represented with fewer tokens. This reduces the sequence length for a given input, allowing the attention mechanism to consider a larger span of the original text within its increased context window.
  • Improved Whitespace Handling (Especially in Code): The GPT-4 tokenizer is much more efficient in handling whitespace, particularly in code like Python. Multiple consecutive spaces are often grouped into a single token. This densifies the representation of code, preventing the excessive consumption of context length by individual space tokens. As a result, the attention mechanism can operate on a more semantically relevant sequence of tokens when processing code.

Is increasing token a way to go?

  • Embedding table gets larger. Prediction at the output (soft Max) grows as well.
  • This increases the number of trainable parameters in the model. It increases the computational cost of predicting the next token.

Encoding for tokenizer

  • Strings are sequences of Unicode code points, which are integer representations of roughly 150,000 characters.
  • Simply using Unicode code points as tokens isn't ideal because the vocabulary would be too large, and the standard is constantly evolving.
  • Encodings like UTF-8, UTF-16, and UTF-32 are used to translate Unicode text into binary data.
    • UTF-8 is the most common encoding. It's a variable-length encoding, using 1 to 4 bytes per Unicode code point. It is also backwards compatible with ASCII.
    • While UTF-8 is preferred, using the raw bytes as tokens would result in a very small vocabulary (256) and extremely long token sequences, making processing inefficient for Transformers with limited context length.
    • Small vocabulary (256 tokens): Each possible byte value becomes a token.
    • Long token sequences: Many Unicode characters require multiple bytes in UTF-8, leading to more tokens per character or word compared to other tokenization methods.
    • Inefficient processing: Longer sequences consume the limited context length of Transformers quickly, restricting the amount of text the model can understand at once.
  • Example: Consider the simple word "你好" (hello in Chinese).
    • Conceptual Character-Level Tokenization: This would be represented by 2 tokens: ["你", "好"]
    • UTF-8 Encoding:
      • "你" is encoded in UTF-8 as the 3 bytes: E4 BD A0 (in hexadecimal).
      • "好" is encoded in UTF-8 as the 3 bytes: E5 A5 BD (in hexadecimal).
    • Raw Byte Tokenization: If each byte were a token, "你好" would be represented by 6 tokens (in their integer values).
    • In this simple example, what was 2 tokens at a conceptual character level becomes 6 tokens using raw bytes. For longer texts, this expansion in the number of tokens would be significant. A Transformer with a context length of, say, 1024 tokens could process much less text if it were tokenized into individual bytes compared to character-level or more advanced tokenization methods that group bytes into meaningful units.

To recap? Why do we need these encoding?

  • Large language models process numbers (tokens), not raw text.
  • Tokenization is the process of turning text (strings of characters) into sequences of tokens (integers).
  • Initially, a simple form of tokenization was character-level, where each character in the training data (like the Shakespeare dataset) became a token.
  • A vocabulary is the set of all possible tokens. In the character-level example, the vocabulary size was 65 (the unique characters in the Shakespeare data).
  • To feed these tokens into a language model, an embedding table is used to convert each integer token into a vector of trainable parameters.
  • We need to support more than just the English alphabet, including different languages and special characters like emojis.
  • In Python, strings are sequences of Unicode code points, which are integer definitions for roughly 150,000 characters.
  • Using Unicode code points directly as tokens would lead to a very large and unstable vocabulary.
  • Encodings like UTF-8 are used to translate Unicode text into binary data (bytes).
  • UTF-8 is a common, variable-length encoding where each Unicode code point becomes 1 to 4 bytes. It's also compatible with ASCII.
  • While UTF-8 is preferred, using raw bytes (each byte as a token) would result in a very small vocabulary (256 tokens) but very long sequences, which is inefficient for Transformers with limited context length.
  • Solution: BPE

Byte Pair Encoding

Byte Pair Encoding (BPE) is a widely used subword tokenization technique that strikes a balance between word-based and character-based tokenization. It is particularly effective in handling rare words and is the foundation for tokenizers in models like GPT and BERT. Example: aaabdaaabac

  • The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:
ZabdZabac
Z=aa
  • Then the process is repeated with byte pair "ab", replacing it with "Y":
ZYdZYac
Y=ab
Z=aa
  • The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing "ZY" with "X":
XdXac
X=ZY
Y=ab
Z=aa
  • This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

How the Tokenizer Layer is Separate from the LLM Layer

  • The Tokenizer is a completely separate, independent module from the LLM.
  • It has its own training dataset of text (which could be different from that of the LLM), on which you train the vocabulary using the Byte Pair Encoding (BPE) algorithm.
  • It acts as a translator between the raw text from web app in chatgpt and the llm (which takes token as input). It then translates back and forth between raw text and sequences of tokens. The LLM later only ever sees the tokens and never directly deals with any text.
token_layer

While training for the tokenizer:

  • You want to add as many types of data as possible - english text is there but hindi text, the more the text you add during for hindi, the more the merges will be for hindi characters, and the better the llm will be able to keep the context.
  • PS: if getting error while decoding, use errors='replace' to handle un-identified tokens.

References

To handle the addition of special tokens in a tokenizer:

  • Special tokens are used to delimit data or introduce structure. Examples include end-of-text or start/end of message tokens.
    • The tokenizer handles special tokens through special-case instructions, outside the Byte Pair Encoding (BPE) algorithm.
    • The code looks for these special strings and swaps them with their assigned token IDs.
  • Changes required in the model:
    • Embedding Layer: The token embedding matrix needs to be extended by adding a new row for each special token's embedding vector. These new embeddings are typically initialized with small random numbers.
    • Output Layer: The final linear layer (used for predicting the next token) needs to be extended to include the new special tokens in the output probability distribution.

Sentencepiece

Commonly used because (unlike tiktoken) it can efficiently both train and inference BPE tokenizers. It is used in both Llama and Mistral series.

sentencepiece: https://github.com/google/sentencepiece

The big difference: sentencepiece runs BPE on the Unicode code points directly! It then has an option character_coverage for what to do with very very rare codepoints that appear very few times, and it either maps them onto an UNK token, or if byte_fallback is turned on, it encodes them with utf-8 and then encodes the raw bytes instead.

TLDR:

  • tiktoken encodes to utf-8 and then BPEs bytes
  • sentencepiece BPEs the code points and optionally falls back to utf-8 bytes for rare code points (rarity is determined by character_coverage hyperparameter), which then get translated to byte tokens.

Design decisions

How to decide the vocab size?

  • reference: https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py
  • In the code for GPT, the vocab size shows up in 2 places - token_embedding_table (where we create the embedding of each token), and last lm_head layer (which is used to produce logits).
    • These logits become the probability of next token in the sequence. So we are trying to produce the probability of every single token that might come next at every point in time of that transformer and if we have more and more tokens, we need to produce more and more probabilities. So every single token is going to introduce an additional dot product for the final layer.
    • There is also the scope of under-training, the more the vocab we have, the more rare it is for some of these tokens to have an occurance, which could lead to undertraining of vectors/ embeddings associated with these tokens.
  • As the vocab size increases, we are able to squish more content into the context window (more merging when vocab size larger), so we can add more context. But this could be a good thing or a bad thing because then we may not be able to process this much info, there could be too much info inside a single token and the forward pass may not be enough to process all that much information aptly.

What if we want to take a pre-trained model and extend the vocabulary?

  • When we are doing fine-tuning for chatgpt, a lot more new special tokens get introduced on top of the base model, to maintain the meta-data and all the structure of the conversation object between the user and the assistant.
  • We may want to add more special token for using the browser or any other tool.
  • So, its very tempting to add a lot of tokens for all kinds of special functionality.
  • Freeze the base model, and only train these new parameters in these 2 layers to add new tokens.
  • Example use-cases: Learning to compress prompts with gist tokens (link)

How to construct transformers that can not just process text, but also other modalities (images) simulataniously?

  • How do you feed in these modalities, and predict these modalities from a transformer?
  • Do we need to change the architecture in some fundamental way?
  • Mostly, people don't change the architecture and just tokenise your input domains and then just pretend that they are text tokens and do everything else identical manner.
  • https://arxiv.org/pdf/2012.09841v3
  • https://openai.com/index/video-generation-models-as-world-simulators/ (sora: Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.)
    • They came up with a way to truncate videos into tokens with their own vocabularies.
Screenshot 2025-03-19 at 3 00 58 AM

Attack tokens

DeepSeek

The video identifies four major things that contribute to DeepSeek's distinctiveness: Innovative Architecture: DeepSeek employs several architectural innovations. These include:

  • Multi-Head Latent Attention: This involves a special type of key-value caching in a latent space, making the attention mechanism more efficient in terms of computation and space.
  • Mixture of Experts (MoE): Unlike traditional neural networks where the entire network is activated, MoE models activate only specific parts based on a routing mechanism, improving efficiency.
  • Multi-Token Prediction: Instead of predicting one token at a time, DeepSeek implemented a method to predict multiple tokens, potentially speeding up the process.
  • Quantization: DeepSeek uses quantization to represent parameters in a more compressed manner (like using a limited palette of colors in an image), reducing memory usage and potentially improving speed.
  • Rotary Positional Encodings: This is an efficient way to encode the position of tokens by rotating the query and key vectors, avoiding the pollution of the embedding vector.

Creative and Innovative Training Methodology:

  • DeepSeek heavily utilizes large-scale reinforcement learning to teach complex reasoning to its models. A key aspect is the use of a rule-based reward system instead of relying on human-labeled data. This system is part of a framework called group relative policy optimization, which is central to their reinforcement learning approach and contributes to the reasoning capabilities of models like DeepSeek R1. DeepSeek's novel application of reinforcement learning with a rule-based reward system appears to be a significant differentiator.

Several GPU Optimization Tricks:

  • DeepSeek has implemented GPU optimization techniques, notably using Nvidia parallel thread execution (PTX) for some functions instead of the industry-standard CUDA. PTX is described as a lower-level intermediate step closer to machine code execution than CUDA, which can lead to significant speedups. This optimization contributes to the cost efficiency of their models.

A Model Ecosystem Favoring Distillation:

  • DeepSeek has a strategy of distilling knowledge from its very large models (like the 671 billion parameter DeepSeek version 3) into smaller models, even as small as 1.5 billion parameters. This allows for a range of models with different computational requirements and potentially different use cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment