Self-Attention: Token-by-Token Processing
Input Sequence:
We have T tokens in our sequence
Each token at position t is denoted as $\mathbf{x}_t$ where $t \in {1, 2, ..., T}$
Each token embedding has dimension $d_{model}$
$$\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, ..., \mathbf{x}_T \quad \text{where each } \mathbf{x}_t \in \mathbb{R}^{d_{model}}$$
About T (Sequence Length)
T varies per input - Different sentences/sequences have different lengths
max_seq_len is the limit - This is the maximum value T can take
For GPT-2: max_seq_len = 1024
For GPT-3: max_seq_len = 2048
For GPT-4: max_seq_len = 8192, 32768, or 128k depending on version
For Claude: max_seq_len = 200k tokens
If your input has 50 tokens → T = 50
If your input has 500 tokens → T = 500
If you try to input 10,000 tokens but max_seq_len = 2048 → Error or truncation
Why does max_seq_len exist?
Positional encodings are pre-computed up to max_seq_len
Computational cost: attention is O(T²) in memory and time
The model was trained with sequences up to that length
Self-attention uses three learned weight matrices:
$$\mathbf{W}_Q \in \mathbb{R}^{d_{model} \times d_k} \quad \text{(Query weight matrix)}$$
$$\mathbf{W}_K \in \mathbb{R}^{d_{model} \times d_k} \quad \text{(Key weight matrix)}$$
$$\mathbf{W}_V \in \mathbb{R}^{d_{model} \times d_v} \quad \text{(Value weight matrix)}$$
Note: Typically $d_k = d_v = d_{model}$ but this is not strictly required.
Step 1: Generate Q, K, V Vectors for Each Token
For each token $\mathbf{x}_t$ , we compute three vectors by multiplying with the weight matrices:
$$\mathbf{q}_1 = \mathbf{W}_Q^T \mathbf{x}_1 \quad \text{where } \mathbf{q}_1 \in \mathbb{R}^{d_k}$$
$$\mathbf{k}_1 = \mathbf{W}_K^T \mathbf{x}_1 \quad \text{where } \mathbf{k}_1 \in \mathbb{R}^{d_k}$$
$$\mathbf{v}_1 = \mathbf{W}_V^T \mathbf{x}_1 \quad \text{where } \mathbf{v}_1 \in \mathbb{R}^{d_v}$$
$$\mathbf{q}_2 = \mathbf{W}_Q^T \mathbf{x}_2$$
$$\mathbf{k}_2 = \mathbf{W}_K^T \mathbf{x}_2$$
$$\mathbf{v}_2 = \mathbf{W}_V^T \mathbf{x}_2$$
General form for token at position $t$ :
$$\mathbf{q}_t = \mathbf{W}_Q^T \mathbf{x}_t \in \mathbb{R}^{d_k}$$
$$\mathbf{k}_t = \mathbf{W}_K^T \mathbf{x}_t \in \mathbb{R}^{d_k}$$
$$\mathbf{v}_t = \mathbf{W}_V^T \mathbf{x}_t \in \mathbb{R}^{d_v}$$
Dimension check:
$$(d_k \times d_{model}) \cdot (d_{model} \times 1) = (d_k \times 1) \quad ✓$$
Step 2: Compute Attention Scores
For each token position $i$ , compute how much it should "attend to" every other token position $j$ .
For token at position $t=1$ :
Compute the dot product of $\mathbf{q}_1$ with all key vectors:
$$\text{score}_{1,1} = \mathbf{q}_1^T \mathbf{k}_1$$
$$\text{score}_{1,2} = \mathbf{q}_1^T \mathbf{k}_2$$
$$\text{score}_{1,3} = \mathbf{q}_1^T \mathbf{k}_3$$
$$\vdots$$
$$\text{score}_{1,T} = \mathbf{q}_1^T \mathbf{k}_T$$
General form for token at position $t$ :
$$\text{score}_{t,j} = \mathbf{q}_t^T \mathbf{k}_j \quad \text{for } j = 1, 2, ..., T$$
This gives us a score vector for position $t$ :
$$\mathbf{score}_t = [\text{score}_{t,1}, \text{score}_{t,2}, ..., \text{score}_{t,T}]^T \in \mathbb{R}^T$$
To prevent gradients from vanishing, we scale by $\sqrt{d_k}$ :
$$ \mathrm{scaled\_score}_{t,j} = \frac{\mathbf{q}_t^T \mathbf{k}_j}{\sqrt{d_k}} $$
Why scale? The dot product grows with dimension size, which causes the softmax to saturate.
Step 4: Apply Softmax (Get Attention Weights)
For each token position $t$ , convert scores into a probability distribution :
$$\alpha_{t,j} = \frac{\exp(\mathrm{scaled\_score}_{t,j})}{\sum_{i=1}^{T} \exp(\mathrm{scaled\_score}_{t,i})}$$
Where:
$\alpha_{t,j}$ = attention weight from token $t$ to token $j$
$\sum_{j=1}^{T} \alpha_{t,j} = 1$ (weights sum to 1)
Each $\alpha_{t,j} \in [0, 1]$
For token 1, the attention weights are:
$$\boldsymbol{\alpha}_1 = [\alpha_{1,1}, \alpha_{1,2}, \alpha_{1,3}, ..., \alpha_{1,T}]^T \in \mathbb{R}^T$$
Step 5: Compute Output (Weighted Sum of Values)
For each token position $t$ , the output is a weighted combination of all value vectors :
$$\mathbf{output}_t = \sum_{j=1}^{T} \alpha_{t,j} \cdot \mathbf{v}_j$$
Expanded:
$$\mathbf{output}_t = \alpha_{t,1} \mathbf{v}_1 + \alpha_{t,2} \mathbf{v}_2 + ... + \alpha_{t,T} \mathbf{v}_T$$
Where $\mathbf{output}_t \in \mathbb{R}^{d_v}$
Complete Process: Token by Token
Compute Q, K, V:
$$\mathbf{q}_1 = \mathbf{W}_Q^T \mathbf{x}_1, \quad \mathbf{k}_1 = \mathbf{W}_K^T \mathbf{x}_1, \quad \mathbf{v}_1 = \mathbf{W}_V^T \mathbf{x}_1$$
Compute scores with all tokens:
$$s_{1,j} = \frac{\mathbf{q}_1^T \mathbf{k}_j}{\sqrt{d_k}} \quad \text{for } j=1,...,T$$
Apply softmax:
$$\alpha_{1,j} = \text{softmax}(\mathbf{s}_1)_j$$
Weighted sum: $$ \text{output}_1 = \sum_{j=1}^{T} \alpha_{1,j} \mathbf{v}_j $$
Compute Q, K, V:
$$\mathbf{q}_2 = \mathbf{W}_Q^T \mathbf{x}_2, \quad \mathbf{k}_2 = \mathbf{W}_K^T \mathbf{x}_2, \quad \mathbf{v}_2 = \mathbf{W}_V^T \mathbf{x}_2$$
Compute scores:
$$s_{2,j} = \frac{\mathbf{q}_2^T \mathbf{k}_j}{\sqrt{d_k}} \quad \text{for } j=1,...,T$$
Apply softmax:
$$\alpha_{2,j} = \text{softmax}(\mathbf{s}_2)_j$$
Weighted sum:
$$\text{output}_2 = \sum_{j=1}^{T} \alpha_{2,j} \mathbf{v}_j$$
$$\boxed{
\begin{align}
\mathbf{q}_t &= \mathbf{W}_Q^T \mathbf{x}_t \\
\mathbf{k}_t &= \mathbf{W}_K^T \mathbf{x}_t \\
\mathbf{v}_t &= \mathbf{W}_V^T \mathbf{x}_t \\
s_{t,j} &= \frac{\mathbf{q}_t^T \mathbf{k}_j}{\sqrt{d_k}} \quad \forall j \in {1,...,T} \\
\alpha_{t,j} &= \text{softmax}(\mathbf{s}_t)_j \\
\mathbf{output}_t &= \sum_{j=1}^{T} \alpha_{t,j} \mathbf{v}_j
\end{align}
}$$
Matrix Form (Processing All Tokens at Once)
In practice, we process all tokens in parallel using matrix operations:
Stack all embeddings into a matrix:
$$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^T \ \mathbf{x}_2^T \ \vdots \ \mathbf{x}_T^T \end{bmatrix} \in \mathbb{R}^{T \times d_{model}}$$
Compute Q, K, V matrices:
$$\mathbf{Q} = \mathbf{X} \mathbf{W}_Q \in \mathbb{R}^{T \times d_k} \quad \text{(each row is } \mathbf{q}_t^T)$$
$$\mathbf{K} = \mathbf{X} \mathbf{W}_K \in \mathbb{R}^{T \times d_k} \quad \text{(each row is } \mathbf{k}_t^T)$$
$$\mathbf{V} = \mathbf{X} \mathbf{W}_V \in \mathbb{R}^{T \times d_v} \quad \text{(each row is } \mathbf{v}_t^T)$$
Compute all attention scores at once:
$$\mathbf{S} = \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \in \mathbb{R}^{T \times T}$$
Where $\mathbf{S}_{i,j} = \frac{\mathbf{q}_i^T \mathbf{k}_j}{\sqrt{d_k}}$
$$\mathbf{A} = \text{softmax}(\mathbf{S}) \in \mathbb{R}^{T \times T}$$
Where $A_{i,j} = \alpha_{i,j}$ and each row sums to 1.
$$\mathbf{Output} = \mathbf{A} \mathbf{V} \in \mathbb{R}^{T \times d_v}$$
Where row $t$ of $\mathbf{Output}$ equals $\mathbf{output}_t^T$
Variable
Shape
Description
$T$
scalar
Sequence length (varies per input, ≤ max_seq_len)
$d_{model}$
scalar
Embedding dimension (e.g., 512, 768)
$d_k$
scalar
Query/Key dimension (usually = $d_{model}$ )
$d_v$
scalar
Value dimension (usually = $d_{model}$ )
$\mathbf{x}_t$
$(d_{model}, 1)$
Input token embedding at position $t$
$\mathbf{W}_Q$
$(d_{model}, d_k)$
Query weight matrix
$\mathbf{W}_K$
$(d_{model}, d_k)$
Key weight matrix
$\mathbf{W}_V$
$(d_{model}, d_v)$
Value weight matrix
$\mathbf{q}_t$
$(d_k, 1)$
Query vector for token $t$
$\mathbf{k}_t$
$(d_k, 1)$
Key vector for token $t$
$\mathbf{v}_t$
$(d_v, 1)$
Value vector for token $t$
$\text{score}_{t,j}$
scalar
Raw attention score from $t$ to $j$
$\alpha_{t,j}$
scalar
Attention weight from $t$ to $j$ (after softmax)
$\mathbf{output}_t$
$(d_v, 1)$
Output vector for token $t$
$\mathbf{X}$
$(T, d_{model})$
All input embeddings stacked
$\mathbf{Q}$
$(T, d_k)$
All query vectors stacked
$\mathbf{K}$
$(T, d_k)$
All key vectors stacked
$\mathbf{V}$
$(T, d_v)$
All value vectors stacked
$\mathbf{S}$
$(T, T)$
Score matrix (before softmax)
$\mathbf{A}$
$(T, T)$
Attention weight matrix (after softmax)
$\mathbf{Output}$
$(T, d_v)$
All output vectors stacked
T varies - Different inputs have different sequence lengths, constrained by max_seq_len
Each token attends to ALL tokens - Token $t$ looks at all $T$ tokens (including itself)
Three projections - Same input creates different representations (Q, K, V)
Scaled dot-product - Similarity between query and keys determines attention
Softmax normalizes - Ensures attention weights sum to 1
Weighted aggregation - Output is a context-aware mixture of all value vectors