Skip to content

Instantly share code, notes, and snippets.

@0xBigBoss
Created October 8, 2024 22:22
Show Gist options
  • Save 0xBigBoss/35d2e0a14a27999c34697495b77604c3 to your computer and use it in GitHub Desktop.
Save 0xBigBoss/35d2e0a14a27999c34697495b77604c3 to your computer and use it in GitHub Desktop.
Attention and Differential Attention functions.
def Attention(X, W_q, W_k, W_v):
Q = X @ W_q
K = X @ W_k
V = X @ W_v
# Q, K, V: [b, n, d]
s = 1 / sqrt(d)
A = Q @ K.transpose(−1,−2) ∗ s
return softmax(A) @ V
def DiffAttn(X, W_q, W_k, W_v, λ):
Q1, Q2 = split(X @ W_q)
K1, K2 = split(X @ W_k)
V = X @ W_v
# Qi, Ki: [b, n, d]; V: [b, n, 2d]
s = 1 / sqrt(d)
A1 = Q1 @ K1.transpose(−1,−2) ∗ s
A2 = Q2 @ K2.transpose(−1,−2) ∗ s
return (softmax(A1)− λ softmax(A2)) @ V
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment