Skip to content

Instantly share code, notes, and snippets.

@mukul54
Last active October 6, 2025 06:21
Show Gist options
  • Save mukul54/c769e16bb8570e0e8e8630f2baeb5523 to your computer and use it in GitHub Desktop.
Save mukul54/c769e16bb8570e0e8e8630f2baeb5523 to your computer and use it in GitHub Desktop.
attention

Self-Attention: Token-by-Token Processing

Setup and Notation

Input Sequence:

  • We have T tokens in our sequence
  • Each token at position t is denoted as $\mathbf{x}_t$ where $t \in {1, 2, ..., T}$
  • Each token embedding has dimension $d_{model}$

$$\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, ..., \mathbf{x}_T \quad \text{where each } \mathbf{x}_t \in \mathbb{R}^{d_{model}}$$

About T (Sequence Length)

  • T varies per input - Different sentences/sequences have different lengths

  • max_seq_len is the limit - This is the maximum value T can take

    • For GPT-2: max_seq_len = 1024
    • For GPT-3: max_seq_len = 2048
    • For GPT-4: max_seq_len = 8192, 32768, or 128k depending on version
    • For Claude: max_seq_len = 200k tokens
  • If your input has 50 tokens → T = 50

  • If your input has 500 tokens → T = 500

  • If you try to input 10,000 tokens but max_seq_len = 2048 → Error or truncation

Why does max_seq_len exist?

  1. Positional encodings are pre-computed up to max_seq_len
  2. Computational cost: attention is O(T²) in memory and time
  3. The model was trained with sequences up to that length

Weight Matrices

Self-attention uses three learned weight matrices:

$$\mathbf{W}_Q \in \mathbb{R}^{d_{model} \times d_k} \quad \text{(Query weight matrix)}$$

$$\mathbf{W}_K \in \mathbb{R}^{d_{model} \times d_k} \quad \text{(Key weight matrix)}$$

$$\mathbf{W}_V \in \mathbb{R}^{d_{model} \times d_v} \quad \text{(Value weight matrix)}$$

Note: Typically $d_k = d_v = d_{model}$ but this is not strictly required.


Step 1: Generate Q, K, V Vectors for Each Token

For each token $\mathbf{x}_t$, we compute three vectors by multiplying with the weight matrices:

Token at position $t=1$:

$$\mathbf{q}_1 = \mathbf{W}_Q^T \mathbf{x}_1 \quad \text{where } \mathbf{q}_1 \in \mathbb{R}^{d_k}$$

$$\mathbf{k}_1 = \mathbf{W}_K^T \mathbf{x}_1 \quad \text{where } \mathbf{k}_1 \in \mathbb{R}^{d_k}$$

$$\mathbf{v}_1 = \mathbf{W}_V^T \mathbf{x}_1 \quad \text{where } \mathbf{v}_1 \in \mathbb{R}^{d_v}$$

Token at position $t=2$:

$$\mathbf{q}_2 = \mathbf{W}_Q^T \mathbf{x}_2$$

$$\mathbf{k}_2 = \mathbf{W}_K^T \mathbf{x}_2$$

$$\mathbf{v}_2 = \mathbf{W}_V^T \mathbf{x}_2$$

General form for token at position $t$:

$$\mathbf{q}_t = \mathbf{W}_Q^T \mathbf{x}_t \in \mathbb{R}^{d_k}$$

$$\mathbf{k}_t = \mathbf{W}_K^T \mathbf{x}_t \in \mathbb{R}^{d_k}$$

$$\mathbf{v}_t = \mathbf{W}_V^T \mathbf{x}_t \in \mathbb{R}^{d_v}$$

Dimension check: $$(d_k \times d_{model}) \cdot (d_{model} \times 1) = (d_k \times 1) \quad ✓$$


Step 2: Compute Attention Scores

For each token position $i$, compute how much it should "attend to" every other token position $j$.

For token at position $t=1$:

Compute the dot product of $\mathbf{q}_1$ with all key vectors:

$$\text{score}_{1,1} = \mathbf{q}_1^T \mathbf{k}_1$$

$$\text{score}_{1,2} = \mathbf{q}_1^T \mathbf{k}_2$$

$$\text{score}_{1,3} = \mathbf{q}_1^T \mathbf{k}_3$$

$$\vdots$$

$$\text{score}_{1,T} = \mathbf{q}_1^T \mathbf{k}_T$$

General form for token at position $t$:

$$\text{score}_{t,j} = \mathbf{q}_t^T \mathbf{k}_j \quad \text{for } j = 1, 2, ..., T$$

This gives us a score vector for position $t$:

$$\mathbf{score}_t = [\text{score}_{t,1}, \text{score}_{t,2}, ..., \text{score}_{t,T}]^T \in \mathbb{R}^T$$


Step 3: Scale the Scores

To prevent gradients from vanishing, we scale by $\sqrt{d_k}$:

$$ \mathrm{scaled\_score}_{t,j} = \frac{\mathbf{q}_t^T \mathbf{k}_j}{\sqrt{d_k}} $$

Why scale? The dot product grows with dimension size, which causes the softmax to saturate.


Step 4: Apply Softmax (Get Attention Weights)

For each token position $t$, convert scores into a probability distribution:

$$\alpha_{t,j} = \frac{\exp(\mathrm{scaled\_score}_{t,j})}{\sum_{i=1}^{T} \exp(\mathrm{scaled\_score}_{t,i})}$$

Where:

  • $\alpha_{t,j}$ = attention weight from token $t$ to token $j$
  • $\sum_{j=1}^{T} \alpha_{t,j} = 1$ (weights sum to 1)
  • Each $\alpha_{t,j} \in [0, 1]$

For token 1, the attention weights are:

$$\boldsymbol{\alpha}_1 = [\alpha_{1,1}, \alpha_{1,2}, \alpha_{1,3}, ..., \alpha_{1,T}]^T \in \mathbb{R}^T$$


Step 5: Compute Output (Weighted Sum of Values)

For each token position $t$, the output is a weighted combination of all value vectors:

$$\mathbf{output}_t = \sum_{j=1}^{T} \alpha_{t,j} \cdot \mathbf{v}_j$$

Expanded:

$$\mathbf{output}_t = \alpha_{t,1} \mathbf{v}_1 + \alpha_{t,2} \mathbf{v}_2 + ... + \alpha_{t,T} \mathbf{v}_T$$

Where $\mathbf{output}_t \in \mathbb{R}^{d_v}$


Complete Process: Token by Token

For $t=1$:

  1. Compute Q, K, V: $$\mathbf{q}_1 = \mathbf{W}_Q^T \mathbf{x}_1, \quad \mathbf{k}_1 = \mathbf{W}_K^T \mathbf{x}_1, \quad \mathbf{v}_1 = \mathbf{W}_V^T \mathbf{x}_1$$

  2. Compute scores with all tokens: $$s_{1,j} = \frac{\mathbf{q}_1^T \mathbf{k}_j}{\sqrt{d_k}} \quad \text{for } j=1,...,T$$

  3. Apply softmax: $$\alpha_{1,j} = \text{softmax}(\mathbf{s}_1)_j$$

  4. Weighted sum: $$ \text{output}_1 = \sum_{j=1}^{T} \alpha_{1,j} \mathbf{v}_j $$

For $t=2$:

  1. Compute Q, K, V: $$\mathbf{q}_2 = \mathbf{W}_Q^T \mathbf{x}_2, \quad \mathbf{k}_2 = \mathbf{W}_K^T \mathbf{x}_2, \quad \mathbf{v}_2 = \mathbf{W}_V^T \mathbf{x}_2$$

  2. Compute scores: $$s_{2,j} = \frac{\mathbf{q}_2^T \mathbf{k}_j}{\sqrt{d_k}} \quad \text{for } j=1,...,T$$

  3. Apply softmax: $$\alpha_{2,j} = \text{softmax}(\mathbf{s}_2)_j$$

  4. Weighted sum: $$\text{output}_2 = \sum_{j=1}^{T} \alpha_{2,j} \mathbf{v}_j$$

For general token $t$:

$$\boxed{ \begin{align} \mathbf{q}_t &= \mathbf{W}_Q^T \mathbf{x}_t \\ \mathbf{k}_t &= \mathbf{W}_K^T \mathbf{x}_t \\ \mathbf{v}_t &= \mathbf{W}_V^T \mathbf{x}_t \\ s_{t,j} &= \frac{\mathbf{q}_t^T \mathbf{k}_j}{\sqrt{d_k}} \quad \forall j \in {1,...,T} \\ \alpha_{t,j} &= \text{softmax}(\mathbf{s}_t)_j \\ \mathbf{output}_t &= \sum_{j=1}^{T} \alpha_{t,j} \mathbf{v}_j \end{align} }$$


Matrix Form (Processing All Tokens at Once)

In practice, we process all tokens in parallel using matrix operations:

Stack all embeddings into a matrix:

$$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^T \ \mathbf{x}_2^T \ \vdots \ \mathbf{x}_T^T \end{bmatrix} \in \mathbb{R}^{T \times d_{model}}$$

Compute Q, K, V matrices:

$$\mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{T \times d_k} \quad \text{(each row is } \mathbf{q}_t^T)$$

$$\mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{T \times d_k} \quad \text{(each row is } \mathbf{k}_t^T)$$

$$\mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{T \times d_v} \quad \text{(each row is } \mathbf{v}_t^T)$$

Compute all attention scores at once:

$$\mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{T \times T}$$

Where $\mathbf{S}_{i,j} = \frac{\mathbf{q}_i^T \mathbf{k}_j}{\sqrt{d_k}}$

Apply softmax row-wise:

$$\mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{T \times T}$$

Where $A_{i,j} = \alpha_{i,j}$ and each row sums to 1.

Compute all outputs:

$$\mathbf{Output} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{T \times d_v}$$

Where row $t$ of $\mathbf{Output}$ equals $\mathbf{output}_t^T$


Summary of Dimensions

Variable Shape Description
$T$ scalar Sequence length (varies per input, ≤ max_seq_len)
$d_{model}$ scalar Embedding dimension (e.g., 512, 768)
$d_k$ scalar Query/Key dimension (usually = $d_{model}$)
$d_v$ scalar Value dimension (usually = $d_{model}$)
$\mathbf{x}_t$ $(d_{model}, 1)$ Input token embedding at position $t$
$\mathbf{W}_Q$ $(d_{model}, d_k)$ Query weight matrix
$\mathbf{W}_K$ $(d_{model}, d_k)$ Key weight matrix
$\mathbf{W}_V$ $(d_{model}, d_v)$ Value weight matrix
$\mathbf{q}_t$ $(d_k, 1)$ Query vector for token $t$
$\mathbf{k}_t$ $(d_k, 1)$ Key vector for token $t$
$\mathbf{v}_t$ $(d_v, 1)$ Value vector for token $t$
$\text{score}_{t,j}$ scalar Raw attention score from $t$ to $j$
$\alpha_{t,j}$ scalar Attention weight from $t$ to $j$ (after softmax)
$\mathbf{output}_t$ $(d_v, 1)$ Output vector for token $t$
$\mathbf{X}$ $(T, d_{model})$ All input embeddings stacked
$\mathbf{Q}$ $(T, d_k)$ All query vectors stacked
$\mathbf{K}$ $(T, d_k)$ All key vectors stacked
$\mathbf{V}$ $(T, d_v)$ All value vectors stacked
$\mathbf{S}$ $(T, T)$ Score matrix (before softmax)
$\mathbf{A}$ $(T, T)$ Attention weight matrix (after softmax)
$\mathbf{Output}$ $(T, d_v)$ All output vectors stacked

Key Takeaways

  1. T varies - Different inputs have different sequence lengths, constrained by max_seq_len
  2. Each token attends to ALL tokens - Token $t$ looks at all $T$ tokens (including itself)
  3. Three projections - Same input creates different representations (Q, K, V)
  4. Scaled dot-product - Similarity between query and keys determines attention
  5. Softmax normalizes - Ensures attention weights sum to 1
  6. Weighted aggregation - Output is a context-aware mixture of all value vectors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment