Skip to content

Instantly share code, notes, and snippets.

@Groverkss
Last active March 8, 2025 07:06
Show Gist options
  • Save Groverkss/5f46be130afa55396d85063ba117db24 to your computer and use it in GitHub Desktop.
Save Groverkss/5f46be130afa55396d85063ba117db24 to your computer and use it in GitHub Desktop.

How to write MLA as MHA

Terminology:

@ := Matrix Multiplication

Example: A @ B = matmul(A, B)

.T := Transpose

Example: A.T = transpose(A)

W prefix := Constant Matrix (Weights for Model)

Wa @ Wb is also constant

x := Input to attention layer

MHA (Multi Headed Attention)

Q = X @ WQ
K = X @ WK
V = X @ WV

att = softmax(Q @ K.T) @ V

result = att @ WO

MLA (Multi Latent Attention)

Q = X @ WaQ  @ WbQ
K = X @ WaKV @ WbK
V = X @ WaKV @ WbV

att = softmax(Q @ K.T) @ V

result = att @ WO

Optimization 1: Absorb WbK with WO

Q = X @ WaQ  @ WbQ
K = X @ WaKV @ WbK
V = X @ WaKV

att = softmax(Q @ K.T) @ V

result = att @ (WO @ WbV)

Optimization 2: Absorb WbK with WbQ

Q @ K.T = (X @ WaQ @ WbQ) @ (X @ WaKV @ WbK).T

Q @ K.T = (X @ WaQ @ WbQ) @ (WbK @ WaKV @ X)

Q @ K.T = (X @ WaQ @ WbQ @ WbK) @ (WaKV @ X)

Q @ K.T = (X @ WaQ @ WbQ @ WbK) @ (X @ WaKV).T

newQ = X @ WaQ @ WbQ @ WbK
newK = X @ WaKV

MLA as MHA with Optimization 2

Q = X @ WaQ @ WbQ @ WbK
K = X @ WaKV
V = X @ WaKV

att = softmax(Q @ K.T) @ V

result = att @ (WO @ WbV)

What about RoPE?

RoPE can completly block Optimization 2, which will render latent compression useless. So, instead add more heads to carry the RoPE computation, instead of adding it to MLA:

Q = X @ WaQ @ WbQ @ WbK
K = X @ WaKV
V = X @ WaKV

Q_rope = RoPE(X @ WaQ  @ WR)
K_rope = RoPE(X @ WaKV @ WK)

att = softmax( Q @ K.T + Q_rope @ K_rope ) @ V

result = att @ (WO @ WbV)

MLA with RoPE as MHA

Addition is just a concat on reduction dimensions, init:

Q = X @ WaQ @ WbQ @ WbK
K = X @ WaKV
V = X @ WaKV

Q_rope = RoPE(X @ WaQ  @ WR)
K_rope = RoPE(X @ WaKV @ WK)

Qc = concat(Q, Q_rope)
Kc = concat(K, K_rope)

att = softmax(Q @ K.T) @ V

result = att @ (WO @ WbV)

And you can again represent MLA as MHA with decoupled RoPE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment