Self-Attention: Token-by-Token Processing

Setup and Notation

Input Sequence:

We have T tokens in our sequence
Each token at position t is denoted as $\mathbf{x}_t$ where $t \in {1, 2, ..., T}$
Each token embedding has dimension $d_{model}$

$$\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, ..., \mathbf{x}_T \quad \text{where each } \mathbf{x}_t \in \mathbb{R}^{d_{model}}$$

About T (Sequence Length)

T varies per input - Different sentences/sequences have different lengths
max_seq_len is the limit - This is the maximum value T can take
- For GPT-2: max_seq_len = 1024
- For GPT-3: max_seq_len = 2048
- For GPT-4: max_seq_len = 8192, 32768, or 128k depending on version
- For Claude: max_seq_len = 200k tokens
If your input has 50 tokens → T = 50
If your input has 500 tokens → T = 500
If you try to input 10,000 tokens but max_seq_len = 2048 → Error or truncation

Why does max_seq_len exist?

Positional encodings are pre-computed up to max_seq_len
Computational cost: attention is O(T²) in memory and time
The model was trained with sequences up to that length

Weight Matrices

Self-attention uses three learned weight matrices:

$$\mathbf{W}_Q \in \mathbb{R}^{d_{model} \times d_k} \quad \text{(Query weight matrix)}$$

$$\mathbf{W}_K \in \mathbb{R}^{d_{model} \times d_k} \quad \text{(Key weight matrix)}$$

$$\mathbf{W}_V \in \mathbb{R}^{d_{model} \times d_v} \quad \text{(Value weight matrix)}$$

Note: Typically $d_k = d_v = d_{model}$ but this is not strictly required.

Step 1: Generate Q, K, V Vectors for Each Token

For each token $\mathbf{x}_t$, we compute three vectors by multiplying with the weight matrices:

Token at position $t=1$:

$$\mathbf{q}_1 = \mathbf{W}_Q^T \mathbf{x}_1 \quad \text{where } \mathbf{q}_1 \in \mathbb{R}^{d_k}$$

$$\mathbf{k}_1 = \mathbf{W}_K^T \mathbf{x}_1 \quad \text{where } \mathbf{k}_1 \in \mathbb{R}^{d_k}$$

$$\mathbf{v}_1 = \mathbf{W}_V^T \mathbf{x}_1 \quad \text{where } \mathbf{v}_1 \in \mathbb{R}^{d_v}$$

Token at position $t=2$:

$$\mathbf{q}_2 = \mathbf{W}_Q^T \mathbf{x}_2$$

$$\mathbf{k}_2 = \mathbf{W}_K^T \mathbf{x}_2$$

$$\mathbf{v}_2 = \mathbf{W}_V^T \mathbf{x}_2$$

General form for token at position $t$:

$$\mathbf{q}_t = \mathbf{W}_Q^T \mathbf{x}_t \in \mathbb{R}^{d_k}$$

$$\mathbf{k}_t = \mathbf{W}_K^T \mathbf{x}_t \in \mathbb{R}^{d_k}$$

$$\mathbf{v}_t = \mathbf{W}_V^T \mathbf{x}_t \in \mathbb{R}^{d_v}$$

Dimension check: $$(d_k \times d_{model}) \cdot (d_{model} \times 1) = (d_k \times 1) \quad ✓$$

Step 2: Compute Attention Scores

For each token position $i$, compute how much it should "attend to" every other token position $j$.

For token at position $t=1$:

Compute the dot product of $\mathbf{q}_1$ with all key vectors:

$$\text{score}_{1,1} = \mathbf{q}_1^T \mathbf{k}_1$$

$$\text{score}_{1,2} = \mathbf{q}_1^T \mathbf{k}_2$$

$$\text{score}_{1,3} = \mathbf{q}_1^T \mathbf{k}_3$$

$$\vdots$$

$$\text{score}_{1,T} = \mathbf{q}_1^T \mathbf{k}_T$$

General form for token at position $t$:

$$\text{score}_{t,j} = \mathbf{q}_t^T \mathbf{k}_j \quad \text{for } j = 1, 2, ..., T$$

This gives us a score vector for position $t$:

$$\mathbf{score}_t = [\text{score}_{t,1}, \text{score}_{t,2}, ..., \text{score}_{t,T}]^T \in \mathbb{R}^T$$

Step 3: Scale the Scores

To prevent gradients from vanishing, we scale by $\sqrt{d_k}$:

$$ \mathrm{scaled\_score}_{t,j} = \frac{\mathbf{q}_t^T \mathbf{k}_j}{\sqrt{d_k}} $$

Why scale? The dot product grows with dimension size, which causes the softmax to saturate.

Step 4: Apply Softmax (Get Attention Weights)

For each token position $t$, convert scores into a probability distribution:

$$\alpha_{t,j} = \frac{\exp(\mathrm{scaled\_score}_{t,j})}{\sum_{i=1}^{T} \exp(\mathrm{scaled\_score}_{t,i})}$$

Where:

$\alpha_{t,j}$ = attention weight from token $t$ to token $j$
$\sum_{j=1}^{T} \alpha_{t,j} = 1$ (weights sum to 1)
Each $\alpha_{t,j} \in [0, 1]$

For token 1, the attention weights are:

$$\boldsymbol{\alpha}_1 = [\alpha_{1,1}, \alpha_{1,2}, \alpha_{1,3}, ..., \alpha_{1,T}]^T \in \mathbb{R}^T$$

Step 5: Compute Output (Weighted Sum of Values)

For each token position $t$, the output is a weighted combination of all value vectors:

$$\mathbf{output}_t = \sum_{j=1}^{T} \alpha_{t,j} \cdot \mathbf{v}_j$$

Expanded:

$$\mathbf{output}_t = \alpha_{t,1} \mathbf{v}_1 + \alpha_{t,2} \mathbf{v}_2 + ... + \alpha_{t,T} \mathbf{v}_T$$

Where $\mathbf{output}_t \in \mathbb{R}^{d_v}$

Complete Process: Token by Token

For $t=1$:

Compute Q, K, V: $$\mathbf{q}_1 = \mathbf{W}_Q^T \mathbf{x}_1, \quad \mathbf{k}_1 = \mathbf{W}_K^T \mathbf{x}_1, \quad \mathbf{v}_1 = \mathbf{W}_V^T \mathbf{x}_1$$
Compute scores with all tokens: $$s_{1,j} = \frac{\mathbf{q}_1^T \mathbf{k}_j}{\sqrt{d_k}} \quad \text{for } j=1,...,T$$
Apply softmax: $$\alpha_{1,j} = \text{softmax}(\mathbf{s}_1)_j$$
Weighted sum: $$ \text{output}_1 = \sum_{j=1}^{T} \alpha_{1,j} \mathbf{v}_j $$

For $t=2$:

Compute Q, K, V: $$\mathbf{q}_2 = \mathbf{W}_Q^T \mathbf{x}_2, \quad \mathbf{k}_2 = \mathbf{W}_K^T \mathbf{x}_2, \quad \mathbf{v}_2 = \mathbf{W}_V^T \mathbf{x}_2$$
Compute scores: $$s_{2,j} = \frac{\mathbf{q}_2^T \mathbf{k}_j}{\sqrt{d_k}} \quad \text{for } j=1,...,T$$
Apply softmax: $$\alpha_{2,j} = \text{softmax}(\mathbf{s}_2)_j$$
Weighted sum: $$\text{output}_2 = \sum_{j=1}^{T} \alpha_{2,j} \mathbf{v}_j$$

For general token $t$:

$$\boxed{ \begin{align} \mathbf{q}_t &= \mathbf{W}_Q^T \mathbf{x}_t \\ \mathbf{k}_t &= \mathbf{W}_K^T \mathbf{x}_t \\ \mathbf{v}_t &= \mathbf{W}_V^T \mathbf{x}_t \\ s_{t,j} &= \frac{\mathbf{q}_t^T \mathbf{k}_j}{\sqrt{d_k}} \quad \forall j \in {1,...,T} \\ \alpha_{t,j} &= \text{softmax}(\mathbf{s}_t)_j \\ \mathbf{output}_t &= \sum_{j=1}^{T} \alpha_{t,j} \mathbf{v}_j \end{align} }$$

Matrix Form (Processing All Tokens at Once)

In practice, we process all tokens in parallel using matrix operations:

Stack all embeddings into a matrix:

$$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^T \ \mathbf{x}_2^T \ \vdots \ \mathbf{x}_T^T \end{bmatrix} \in \mathbb{R}^{T \times d_{model}}$$

Compute Q, K, V matrices:

$$\mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{T \times d_k} \quad \text{(each row is } \mathbf{q}_t^T)$$

$$\mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{T \times d_k} \quad \text{(each row is } \mathbf{k}_t^T)$$

$$\mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{T \times d_v} \quad \text{(each row is } \mathbf{v}_t^T)$$

Compute all attention scores at once:

$$\mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{T \times T}$$

Where $\mathbf{S}_{i,j} = \frac{\mathbf{q}_i^T \mathbf{k}_j}{\sqrt{d_k}}$

Apply softmax row-wise:

$$\mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{T \times T}$$

Where $A_{i,j} = \alpha_{i,j}$ and each row sums to 1.

Compute all outputs:

$$\mathbf{Output} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{T \times d_v}$$

Where row $t$ of $\mathbf{Output}$ equals $\mathbf{output}_t^T$

Summary of Dimensions

Variable	Shape	Description
$T$	scalar	Sequence length (varies per input, ≤ max_seq_len)
$d_{model}$	scalar	Embedding dimension (e.g., 512, 768)
$d_k$	scalar	Query/Key dimension (usually = $d_{model}$)
$d_v$	scalar	Value dimension (usually = $d_{model}$)
$\mathbf{x}_t$	$(d_{model}, 1)$	Input token embedding at position $t$
$\mathbf{W}_Q$	$(d_{model}, d_k)$	Query weight matrix
$\mathbf{W}_K$	$(d_{model}, d_k)$	Key weight matrix
$\mathbf{W}_V$	$(d_{model}, d_v)$	Value weight matrix
$\mathbf{q}_t$	$(d_k, 1)$	Query vector for token $t$
$\mathbf{k}_t$	$(d_k, 1)$	Key vector for token $t$
$\mathbf{v}_t$	$(d_v, 1)$	Value vector for token $t$
$\text{score}_{t,j}$	scalar	Raw attention score from $t$ to $j$
$\alpha_{t,j}$	scalar	Attention weight from $t$ to $j$ (after softmax)
$\mathbf{output}_t$	$(d_v, 1)$	Output vector for token $t$
$\mathbf{X}$	$(T, d_{model})$	All input embeddings stacked
$\mathbf{Q}$	$(T, d_k)$	All query vectors stacked
$\mathbf{K}$	$(T, d_k)$	All key vectors stacked
$\mathbf{V}$	$(T, d_v)$	All value vectors stacked
$\mathbf{S}$	$(T, T)$	Score matrix (before softmax)
$\mathbf{A}$	$(T, T)$	Attention weight matrix (after softmax)
$\mathbf{Output}$	$(T, d_v)$	All output vectors stacked

Key Takeaways

T varies - Different inputs have different sequence lengths, constrained by max_seq_len
Each token attends to ALL tokens - Token $t$ looks at all $T$ tokens (including itself)
Three projections - Same input creates different representations (Q, K, V)
Scaled dot-product - Similarity between query and keys determines attention
Softmax normalizes - Ensures attention weights sum to 1
Weighted aggregation - Output is a context-aware mixture of all value vectors

mukul54/attetnion.md

Self-Attention: Token-by-Token Processing

Setup and Notation

About T (Sequence Length)

Weight Matrices

Step 1: Generate Q, K, V Vectors for Each Token

Token at position $t=1$:

Token at position $t=2$:

General form for token at position $t$:

Step 2: Compute Attention Scores

For token at position $t=1$:

General form for token at position $t$:

Step 3: Scale the Scores

Step 4: Apply Softmax (Get Attention Weights)

Step 5: Compute Output (Weighted Sum of Values)

Complete Process: Token by Token

For $t=1$:

For $t=2$:

For general token $t$:

Matrix Form (Processing All Tokens at Once)

Stack all embeddings into a matrix:

Compute Q, K, V matrices:

Compute all attention scores at once:

Apply softmax row-wise:

Compute all outputs:

Summary of Dimensions

Key Takeaways