@ := Matrix Multiplication
Example: A @ B = matmul(A, B)
.T := Transpose
Example: A.T = transpose(A)
W prefix := Constant Matrix (Weights for Model)
Wa @ Wb is also constant
x := Input to attention layer
Q = X @ WQ
K = X @ WK
V = X @ WV
att = softmax(Q @ K.T) @ V
result = att @ WO
Q = X @ WaQ @ WbQ
K = X @ WaKV @ WbK
V = X @ WaKV @ WbV
att = softmax(Q @ K.T) @ V
result = att @ WO
Q = X @ WaQ @ WbQ
K = X @ WaKV @ WbK
V = X @ WaKV
att = softmax(Q @ K.T) @ V
result = att @ (WO @ WbV)
Q @ K.T = (X @ WaQ @ WbQ) @ (X @ WaKV @ WbK).T
Q @ K.T = (X @ WaQ @ WbQ) @ (WbK @ WaKV @ X)
Q @ K.T = (X @ WaQ @ WbQ @ WbK) @ (WaKV @ X)
Q @ K.T = (X @ WaQ @ WbQ @ WbK) @ (X @ WaKV).T
newQ = X @ WaQ @ WbQ @ WbK
newK = X @ WaKV
Q = X @ WaQ @ WbQ @ WbK
K = X @ WaKV
V = X @ WaKV
att = softmax(Q @ K.T) @ V
result = att @ (WO @ WbV)
RoPE can completly block Optimization 2, which will render latent compression useless. So, instead add more heads to carry the RoPE computation, instead of adding it to MLA:
Q = X @ WaQ @ WbQ @ WbK
K = X @ WaKV
V = X @ WaKV
Q_rope = RoPE(X @ WaQ @ WR)
K_rope = RoPE(X @ WaKV @ WK)
att = softmax( Q @ K.T + Q_rope @ K_rope ) @ V
result = att @ (WO @ WbV)
Addition is just a concat on reduction dimensions, init:
Q = X @ WaQ @ WbQ @ WbK
K = X @ WaKV
V = X @ WaKV
Q_rope = RoPE(X @ WaQ @ WR)
K_rope = RoPE(X @ WaKV @ WK)
Qc = concat(Q, Q_rope)
Kc = concat(K, K_rope)
att = softmax(Q @ K.T) @ V
result = att @ (WO @ WbV)
And you can again represent MLA as MHA with decoupled RoPE