"I used to be a standard residual connection. Then I learned to pay attention."
Paper: https://arxiv.org/abs/2603.15031 Code: https://github.com/MoonshotAI/Attention-Residuals (2.8k stars) Authors: Kimi Team (Moonshot AI), March 2026 Model: Kimi Linear - 48B total / 3B activated, trained on 1.4T tokens
Imagine Goku fighting through the Snake Way:
╔═════════════════════════════════════════════════════════════════════╗
║ ║
║ THE SNAKE WAY - Every fighter adds their power to the next fighter ║
║ ║
║ Fighter 1 (Goku base) Power Level: 9001 ║
║ │ ║
║ v ║
║ Fighter 2 = Fighter 1 + Training Power Level: 9001 + 2000 ║
│ │ = 11001 ║
│ │ ║
│ v ║
│ Fighter 3 = Fighter 2 + Training Power Level: 11001 + 3000 ║
│ │ = 14001 ║
│ │ ║
│ v ║
│ ... ║
│ v ║
│ Fighter 100 Power Level: 198,001 ║
│ = 9001 + 99*2000 ║
│ ^^^^^^^^^^^^^^^^ ║
│ ALL fighters are weighted ║
│ EQUALLY. Fighter 1's ║
│ original 9001 is now ║
│ just 4.5% of the total! ║
│ ║
╚═════════════════════════════════════════════════════════════════════╝
Fighter 1 (Goku's base power) is almost INVISIBLE by the end.
Same thing happens in LLMs - early layer knowledge gets DILUTED.
This is exactly how every modern AI model (GPT, LLaMA, DeepSeek) works. It's called the residual connection:
h_layer_l = h_layer_(l-1) + Transformation(h_layer_(l-1))
^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
ALL previous This layer's
power added new contribution
with weight=1 with weight=1
EVERY layer gets weight = 1. NO EXCEPTIONS. NO SELECTIVITY.
THE DILUTION PROBLEM:
Layer 1's contribution to Layer 100: 1/100 = 1% 😱
Layer 50's contribution to Layer 100: 1/100 = 1% 😱
Layer 99's contribution to Layer 100: 1/100 = 1% 😱
Fighter 1's power is just 4.5% of the total! Even though
Fighter 1 learned FUNDAMENTAL stuff like "how to punch."
In LLM terms:
Layer 1 learned basic syntax (like "subjects come before verbs")
Layer 50 learned complex reasoning patterns
Layer 99 learned... something
By the end, Layer 1's syntax knowledge is drowned out.
Most modern LLMs use Pre-LayerNorm (normalize before each layer). This makes the problem WORSE:
WITH PRENORM, HIDDEN STATES GROW UNBOUNDABLY:
Layer 1: magnitude = 1.0x
Layer 10: magnitude = 3.2x
Layer 30: magnitude = 5.5x
Layer 60: magnitude = 7.8x
Layer 100: magnitude = 10.1x 😱😱
It's like every fighter in the Snake Way keeps getting BIGGER
but the power levels aren't balanced. The later fighters
are 10x stronger just because they've accumulated more stuff,
not because they're actually 10x better.
This is called "PreNorm dilution" and it's a KNOWN BUG
in modern LLMs that everyone just lives with.
What if instead of blindly adding everyone's power, each fighter could CHOOSE which previous fighters to draw power from?
╔═════════════════════════════════════════════════════════════════════╗
║ ║
║ ATTENTION RESIDUALS - Each fighter CHOOSES who to borrow from ║
║ ║
║ Fighter 100 needs to attack! ║
║ ║
║ OLD WAY (Standard Residuals): ║
│ Fighter 100 = 100% of Fighter 1 ║
│ + 100% of Fighter 2 ║
│ + ... ║
│ + 100% of Fighter 99 ║
│ ║
│ Problem: Fighter 1's technique is lost in the noise ║
║ ║
║ NEW WAY (Attention Residuals): ║
│ ║
│ Fighter 100 thinks: ║
│ "I need precise combat technique... I'll borrow 35% ║
│ from Fighter 1 (he learned the basics!)" ║
│ "I need raw power... I'll borrow 25% from ║
│ Fighter 50" ║
│ "I need... nah, Fighter 99 is useless for this" ║
│ "I'll take 5% from Fighter 2" ║
│ "And the remaining 35% from my own training" ║
│ ║
│ Fighter 100 = 0.35*F1 + 0.25*F50 + 0.05*F2 + ... ║
│ 0.35*F100 ║
│ ║
║ >>> Fighter 1's technique is PRESERVED and amplified! ║
║ >>> Each fighter gets CUSTOM weights based on NEED! ║
║ ╚═══════════════════════════════════════════════════════════════════╝
STANDARD RESIDUAL:
h_l = h_{l-1} + f_{l-1}(h_{l-1})
Which unrolls to:
h_l = h_1 + f_1(h_1) + f_2(h_2) + ... + f_{l-1}(h_{l-1})
^^^^ ^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^^^^^^^
w=1 w=1 w=1 w=1
ALL WEIGHTS ARE 1
══════════════════════════════════════════════════════════════
ATTENTION RESIDUAL:
h_l = alpha_{0->l} * v_0 + alpha_{1->l} * v_1 + ... + alpha_{l-1->l} * v_{l-1}
^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
softmax softmax softmax softmax
attention attention attention attention
weights weights weights weights
Where alpha_{i->l} = "how much should layer l borrow from layer i"
THE WEIGHTS ARE LEARNED AND INPUT-DEPENDENT!
Different inputs = different borrowing patterns!
══════════════════════════════════════════════════════════════
Each layer l has ONE learned vector: w_l (like a "preference list")
To decide how much to borrow from layer i, compute:
score = w_l . RMSNorm(v_i)
^^^ ^^^^^^^^^
layer's normalize the value
preference (prevent big layers
from dominating)
Then softmax turns scores into probabilities (sum to 1.0)
┌─────────────────────────────────────────────────────────┐
│ │
│ Layer 12 is deciding who to borrow from: │
│ │
│ w_12 = [0.3, -0.1, 0.8, ...] (learned preference)│
│ │
│ v_0 (basic syntax): score = 2.1 │
│ v_1 (word meanings): score = 0.3 │
│ v_2 (patterns): score = -0.5 │
│ v_11 (context): score = 1.2 │
│ │
│ After softmax: │
│ v_0: 45% <<< "I need the basics!" │
│ v_1: 12% │
│ v_2: 3% <<< "Not relevant right now" │
│ v_11: 25% <<< "Some context helps" │
│ v_self: 15% │
│ │
│ COST: ONE vector (w_l) per layer. That's it. │
│ Cheaper than buying a senzu bean on discount! │
│ │
└─────────────────────────────────────────────────────────┘
The paper's key insight is a beautiful observation:
╔═════════════════════════════════════════════════════════════════════╗
║ ║
║ SEQUENCE MODELING (RNNs) <----> DEPTH (Residuals) ║
║ ║
║ In sequences: ║
│ RNNs compressed ALL past tokens into ONE state │
│ Then Transformers replaced this with ATTENTION │
│ (each token can selectively access ALL previous tokens) ║
│ │
│ In depth: ║
│ Residuals compressed ALL past layers into ONE state │
│ (h_{l-1} is a soup of everything before it) ║
│ │
│ AttnRes applies the SAME fix: ║
│ Replace compression with ATTENTION over depth! ║
│ Each layer can selectively access ALL previous layers ║
│ ║
║ ⚡ This is literally the same idea that made Transformers ║
│ dominant over RNNs, but applied to the depth dimension! ║
│ ║
╚═════════════════════════════════════════════════════════════════════╝
SEQUENCE ATTENTION: "But depth is small, you can't do O(L^2)!"
Actually, depth is TINY compared to sequence length!
┌────────────────────────────────────────────────────┐
│ │
│ Sequence length: 1,000 to 1,000,000 tokens │
│ Model depth: 12 to 128 layers │
│ │
│ O(128^2) = 16,384 operations per token │
│ This is NOTHING compared to sequence attention! │
│ Even O(1000^2) would be manageable. │
│ │
│ >>> Attention over depth is CHEAP and FEASIBLE │
│ │
└────────────────────────────────────────────────────┘
FULL ATTENRES (The Dream):
Layer 1: attend over [v_0] -- 1 value
Layer 2: attend over [v_0, v_1] -- 2 values
Layer 3: attend over [v_0, v_1, v_2] -- 3 values
...
Layer 99: attend over [v_0, v_1, ..., v_98] -- 99 values!
MEMORY NEEDED per token:
For 100 layers, d=4096:
100 * 4096 = 409,600 values
With batch=1, seq_len=8192:
409,600 * 8,192 = 3.35 BILLION values
At fp16 = 6.7 GB just for residuals! 😱
Plus, in pipeline parallelism (multiple GPUs), every GPU needs
ALL layer outputs from previous GPUs. That's MASSIVE communication.
┌─────────────────────────────────────────────────────────┐
│ Full AttnRes: Perfect selectivity, but 6.7 GB extra │
│ per token. Like carrying the entire │
│ Snake Way roster on your back. │
└─────────────────────────────────────────────────────────┘
BLOCK ATTENRES (The Reality):
Instead of attending over every single layer, GROUP layers into blocks.
100 layers → 8 blocks of ~12 layers each
Within a block: standard residuals (cheap, no attention)
Between blocks: attention over block representations (selective!)
┌─────────────────────────────────────────────────────────────┐
│ │
│ Block 0 (layers 1-12): Standard residuals │
│ Summary: "Here's everything from layers 1-12" │
│ │
│ Block 1 (layers 13-24): Standard residuals │
│ Before each sub-layer: attend over [Block0_summary] │
│ Summary: "Here's everything from blocks 0-1" │
│ │
│ Block 2 (layers 25-36): Standard residuals │
│ Before each sub-layer: attend over [B0, B1_summary] │
│ Summary: "Here's everything from blocks 0-2" │
│ │
│ ... │
│ │
│ Block 7 (layers 88-100): Standard residuals │
│ Before each sub-layer: attend over [B0...B6] │
│ Summary: "Here's everything from blocks 0-7" │
│ │
│ ATTEND: Final output aggregates all 8 block summaries │
│ │
└─────────────────────────────────────────────────────────────┘
MEMORY: 8 blocks * d = 8 * 4096 = 32,768 values
vs Full AttnRes: 100 * 4096 = 409,600 values
= 12.5x LESS MEMORY! 🎉
LAYER 36 (in Block 3) needs to compute its input:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Step 1: Stack all available block summaries │
│ │
│ V = [Block0, Block1, Block2, PartialBlock3] │
│ │ │ │ │ │
│ │ │ │ v │
│ │ │ │ (running sum of │
│ │ │ │ layers 25-35 so far) │
│ │
│ Step 2: Normalize all values │
│ K = RMSNorm(V) │
│ │
│ Step 3: Compute attention scores using layer 36's preference │
│ w_36 . K → [2.1, 0.3, -0.5, 1.2] │
│ │
│ Step 4: Softmax (turn into percentages) │
│ weights = [0.35, 0.12, 0.03, 0.25, 0.25] │
│ │
│ Step 5: Weighted sum │
│ output = 0.35*B0 + 0.12*B1 + 0.03*B2 + 0.25*Partial │
│ │
│ >>> Layer 36 borrows most from Block 0 (early features) │
│ because that's what this particular token needs! │
│ │
└─────────────────────────────────────────────────────────────┘
BLOCK SIZE SWEEP (from the paper):
Blocks: 32 16 8 4 2 1
│ │ │ │ │ │
Loss: 1.77 1.77 1.75 1.75 1.82 1.74
│ │ │ │ │ │
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
BASELINE
Block 1 (Full AttnRes): Best loss, but too expensive
Blocks 2-8: Nearly identical performance!
Block 16+: Degrading back toward baseline
Block 32: Basically baseline (standard residuals)
┌─────────────────────────────────────────────────────────────┐
│ │
│ The Dragon Ball Z Analogy: │
│ │
│ 8 blocks = 8 Dragon Ball fighters in a team │
│ │
│ Instead of ALL 100 fighters adding their power to each │
│ fighter's attack (chaos, dilution): │
│ │
│ Fighter 100 says: "Hey, let me check with the team │
│ captain (Block 0), then the vice-captain (Block 1), │
│ then..." │
│ │
│ Only needs to coordinate with 8 team captains, │
│ not all 99 fighters. Much more organized! │
│ │
└─────────────────────────────────────────────────────────────┘
In real training, different GPUs handle different layers (pipeline parallelism). Block AttnRes needs block summaries from ALL previous GPUs, which is a communication nightmare.
NAIVE APPROACH (Bad):
GPU 0: has [Block0]
GPU 1: needs [Block0, Block1] → GPU 0 sends Block0
GPU 2: needs [Block0, Block1, Block2] → GPU 1 sends B0, B1
GPU 3: needs [Block0, Block1, Block2, Block3] → GPU 2 sends B0, B1, B2
...
EVERY TRANSFER RE-SENDS EVERYTHING. Redundant!
Communication cost: O(C^2 * N * d) where C = pipeline chunks
For 4 GPUs, 2 virtual stages: ~12 redundant transfers!
SMART APPROACH (What they actually do):
GPU 0: has [Block0]
GPU 1: has [Block0, Block1] (cached Block0 locally!)
GPU 2: has [Block0, Block1, Block2] (cached B0, B1!)
When GPU 2 needs to send to GPU 3:
→ Only sends the NEW Block2 (incremental!)
┌─────────────────────────────────────────────────────────────┐
│ │
│ GPU 0 GPU 1 GPU 2 GPU 3 │
│ ┌─────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │B0 │ │B0 │ │B0 │ │B0 │ │
│ └──┬──┘ │B1 │ │B1 │ │B1 │ │
│ └────────>│──────│──────>│ │ │B2 │ │
│ │ │ │B2 │ └──┬──┘ │
│ │ │ └──┬───┘ │B3 │ │
│ │ │ │ └──────┘ │
│ │ │ │ │
│ └──────┘ │ │
│ │
│ Each GPU caches blocks from previous stages. │
│ Only NEW blocks are transmitted. │
│ Peak communication drops from O(C) to O(P) per transition. │
│ │
│ That's a 2x improvement in communication! │
│ │
└─────────────────────────────────────────────────────────────┘
THE TWO-PHASE TRICK FOR INFERENCE:
Phase 1 (Parallel - batch all queries):
┌──────────────────────────────────────────────────────┐
│ │
│ Within Block 3, all 12 layers need attention over │
│ [Block0, Block1, Block2]. │
│ │
│ OLD WAY: Read blocks 12 times (once per layer) │
│ NEW WAY: Read blocks ONCE, batch all 12 queries, │
│ compute all answers simultaneously. │
│ │
│ Read cost: 12 reads → 1 read. 12x speedup! 🚀 │
│ │
└──────────────────────────────────────────────────────┘
Phase 2 (Sequential - merge with local):
┌──────────────────────────────────────────────────────┐
│ │
│ Within Block 3, layer 5 needs to attend over │
│ the PARTIAL sum (layers 25-29 within the block). │
│ │
│ This must be sequential because each layer's partial │
│ sum changes. But it's just 1 read per layer. │
│ │
│ Uses "online softmax" to merge Phase 1 and Phase 2 │
│ results exactly. No approximation! │
│ │
└──────────────────────────────────────────────────────┘
TOTAL INFERENCE OVERHEAD: Less than 2%! Almost free!
VALIDATION LOSS vs COMPUTE BUDGET:
Loss │
1.85 ┤ * Full AttnRes
1.80 ┤ * Block AttnRes
1.75 ┤ * * Baseline
1.70 ┤ * * * * * * * * * * * * * * * *
1.65 ┤ * * * * * * * * * * * * * * *
1.60 ┤ * * * * * * * * * * * * * * * * *
1.55 ┤
└──────────────────────────────────────────
0.5 1 2 5 10 20 50
PFLOP/s-days (compute budget)
Block AttnRes at compute X = Baseline at compute X * 1.25
>>> 25% MORE compute-efficient!
>>> Same loss, 25% less training cost!
╔═════════════════════════════════════════════════════════════════════╗
║ Benchmark Baseline AttnRes Gain Dragon Ball Z ║
╠════════════════════════════════════════════════════════════════════╣═════════════════════╣
║ ║
║ REASONING ║
║ GPQA-Diamond 36.9 44.4 +7.5 🐉🥊🥊🥊🥊🥊🥊🥊 ║
║ Math 53.5 57.1 +3.6 🥊🥊🥊🥊🥊🥊🥊🥊 ║
║ HumanEval 59.1 62.2 +3.1 🥊🥊🥊🥊🥊🥊🥊🥊 ║
║ MBPP 72.0 73.9 +1.9 🥊🥊🥊🥊🥊🥊🥊🥊 ║
║ ║
║ GENERAL KNOWLEDGE ║
║ BBH 76.3 78.0 +1.7 🥊🥊🥊🥊🥊🥊🥊🥊 ║
║ TriviaQA 69.9 71.8 +1.9 🥊🥊🥊🥊🥊🥊🥊 ║
║ MMLU 73.5 74.6 +1.1 🥊🥊🥊🥊🥊🥊🥊 ║
║ ARC-Challenge 64.6 65.7 +1.1 🥊🥊🥊🥊🥊🥊🥊 ║
║ ║
║ CHINESE ║
║ C-Eval 79.6 82.5 +2.9 🥊🥊🥊🥊🥊🥊🥊 ║
║ CMMLU 82.0 82.9 +0.9 🥊🥊🥊🥊🥊🥊 ║
║ ║
╚═══════════════════════════════════════════════════════════════════╝
BIGGEST WINS on multi-step reasoning (+7.5 GPQA) and code (+3.1 HumanEval).
This makes sense! When doing complex reasoning, you need to
go back to EARLY layers for fundamental knowledge.
Standard residuals bury that. AttnRes un-buries it.
HIDDEN STATE MAGNITUDES (across depth):
Standard Residuals: Block AttnRes:
mag │ │
10 │ * │
8 │ * * │
6 │ * * * │
4 │ * * * * │
2 │ * * * * * * * * │
0 │* * * * * * * * * * * * │* * * * * * * * * * *
└────────────────────────────── └──────────────────────
Layer 1 Layer 100 Layer 1 Layer 100
GROWS 10x! STAYS FLAT!
Later layers are Early layers are
10x stronger just preserved. No
because they accumulated "later layers are
more stuff, not because stronger."
┌─────────────────────────────────────────────────────────────┐
│ It's like the difference between: │
│ │
│ 1) A saiyan who keeps transforming (magnitude grows │
│ with each form) vs │
│ │
│ 2) Goku who uses Kaioken strategically │
│ (powers up, borrows, powers down as needed) │
│ │
└─────────────────────────────────────────────────────────────┘
GRADIENT MAGNITUDES (across depth):
Standard Residuals: Block AttnRes:
grad │ * │
8 │ * │
6 │ * * │
4 │ * * * * * │ * * * * * * *
2 │ * * * * * * * * * * * * * * * * * * * * * * * * * │
0 │* * * * * * * * * * * * * * * * * * * * * * * * * * │* * * * * * * *
└─────────────────────────────────────────────────────────────────┘
Layer 1 Layer 100 Layer 1 Layer 100
First layers get All layers get similar
HUGE gradients (unstable!) gradients (stable!)
WHY? Because attention weights create COMPETITION.
If one source layer is sending too many gradients,
another one takes over. Self-regulating!
┌─────────────────────────────────────────────────────────────┐
│ Standard: "ALL GRADIENTS GO THROUGH EVERY LAYER!" │
│ (Layer 1 gets crushed by 99 layers) │
│ │
│ AttnRes: "Each layer competes for gradient │
│ flow. Weights adjust automatically." │
│ (Healthy competition, like a tournament) │
└─────────────────────────────────────────────────────────────┘
╔═══════════════════════════════════════════════════════════════════╗
║ Variant Loss vs Baseline Verdict ║
╠══════════════════════════════════════════════════════════════════╣═════════════════════╣
║ Baseline (PreNorm) 1.766 -- Standard ║
║ DenseFormer (static weights) 1.767 +0.001 NO GAIN ║
║ mHC (multi-stream mixing) 1.747 +0.019 Better ║
║ Full AttnRes 1.737 +0.029 BEST 🔥 ║
║ Block AttnRes (8 blocks) 1.746 +0.020 Almost best ║
║ Block AttnRes (4 blocks) 1.746 +0.020 Same as 8 blocks║
║ ║
║ KEY ABLATIONS: ║
╠══════════════════════════════════════════════════════════════════╣═════════════════════╣
║ Sigmoid instead of softmax 1.741 WORSE Competition ║
║ Multihead depth attention 1.752 WORSE Uniform mix ║
║ No RMSNorm on keys 1.743/750 WORSE Norm matters ║
║ Input-dependent query (from h_l) 1.731 BETTER But costs d^2 ║
║ Sliding window (W=8) 1.764 SLIGHT GAIN Local only ║
║ ║
╚═══════════════════════════════════════════════════════════════════╝
KEY INSIGHTS:
1. Input-dependent weights are CRUCIAL (DenseFormer with static
weights showed NO gain over baseline)
2. Softmax normalization (competition) beats sigmoid
3. Depth mixing should be UNIFORM across channels (multihead hurts)
4. RMSNorm on keys prevents big layers from dominating
(like normalizing power levels so everyone fights fair)
5. Distant layers matter MORE than nearby ones (sliding window
with W=8 is much worse than block attention with N=8)
The paper shows that ALL residual variants can be viewed as a "depth mixing matrix" M.
M_{i→l} = "how much does layer l borrow from layer i"
══════════════════════════════════════════════════════════════════╗
║ ║
║ STANDARD RESIDUAL (M = all-ones matrix): ║
║ ║
║ L0 L1 L2 L3 ║
║ L0 [ 1 1 1 ] ║
║ L1 [ 1 1 1 ] ║
║ L2 [ 1 1 1 ] Every layer gets weight 1 ║
║ L3 [ 1 1 1 ] regardless of input. ║
║ ║
║ Like: "Every fighter adds their full power to the next" ║
║ Dragon Ball: "Everyone goes Super Saiyan, no strategy" ║
║ ╚═══════════════════════════════════════════════════════════╝
══════════════════════════════════════════════════════════════════╗
║ ║
║ ATTENTION RESIDUAL (M = dense, rank-L matrix): ║
║ ║
║ L0 L1 L2 L3 ║
║ L0 [ 0.5 0.1 0.3 0.1 ] ← input A: "I need basics" ║
║ L1 [ 0.2 0.4 0.1 0.3 ] ← input B: "I need recent" ║
║ L2 [ 0.1 0.2 0.1 0.6 ] ← input C: "I need context"║
║ L3 [ 0.2 0.3 0.5 0.0 ] ← input D: "I need latest" ║
║ ║
# Each row sums to 1.0 (softmax normalization) ║
# Each row is DIFFERENT for different inputs (input-dependent) ║
# Like: "Each fighter picks their teammates strategically" ║
# Dragon Ball: "The team captain assigns roles dynamically" ║
║ ╚═══════════════════════════════════════════════════════════╝
══════════════════════════════════════════════════════════════════╗
║ ║
║ HIGHWAY NETWORKS (M = 1-semiseparable matrix): ║
║ ║
║ L0 L1 L2 L3 ║
║ L0 [ g1 0 0 0 ] g1 = gate for layer 1 ║
║ L1 [ g1 g2 0 0 ] g2 = gate for layer 2 ║
║ L2 [ g1 g2 g3 0 ] g3 = gate for layer 3 ║
║ L3 [ g1 g2 g3 g4 ] gates are learned ║
║ ║
# Gates decide: "keep my own power or borrow from previous" ║
# Like: "Each fighter decides: use my move or borrow?" ║
# Dragon Ball: "Each fighter has a fusion move" ║
║ ╚═════════════════════════════════════════════════════════════╝
ALL existing methods (Standard, Highway, mHC) are instances of
LINEAR ATTENTION over depth.
AttnRes is SOFTMAX ATTENTION over depth.
This is the SAME transition that made Transformers dominant over RNNs:
RNNs (linear attention) → Transformers (softmax attention) = GAME CHANGER
Residuals (linear attention) → AttnRes (softmax attention) = POTENTIAL GAME CHANGER
The authors call this the "Sequence-Depth Duality" and it's
one of the most elegant observations in the paper.
Under fixed compute budget, what depth/width is optimal?
┌─────────────────────────────────────────────────────────────────┐
│ BASELINE optimal: │
│ │
│ dmodel/Lb = 60 (wider, shallower) │
│ "Make it wide, not deep" │
│ │
│ ╔════════════╗ │
│ ║ │
│ ▼ │
│ dmodel/Lb = 15 │
│ dmodel/Lb = 30 │
│ dmodel/Lb = 45 │
│ dmodel/Lb = 60 ← OPTIMAL │
│ dmodel/Lb = 75 │
│ │
│ │
│ ATTNRES optimal: │
│ │
│ dmodel/Lb = 45 (deeper, narrower!) │
│ "Make it deeper, AttnRes handles the depth" │
│ │
│ ╔════════════╗ │
│ ║ │
│ ▼ │
│ dmodel/Lb = 15 │
│ dmodel/Lb = 30 │
│ dmodel/Lb = 45 ← OPTIMAL │
│ dmodel/Lb = 60 │
│ dmodel/Lb = 75 │
│ │
│ AttnRes shifts the optimum toward DEEPER models because │
│ it can exploit depth more effectively. The selective access to │
│ earlier layers means deep models don't lose information. │
└─────────────────────────────────────────────────────────────────┘
ATTENTION WEIGHT PATTERNS (16-head model):
Pre-Attention layers (before self-attention):
┌─────────────────────────────────────────────────────┐
│ Source: EMB B0 B1 B2 B3 B4 B5 │
│ Layer 1: 0.35 0.12 0.03 0.25 0.05 0.02 │
│ Layer 2: 0.30 0.15 0.05 0.20 0.10 0.03 │
│ Layer 3: 0.25 0.20 0.08 0.15 0.12 0.05 │
│ Layer 4: 0.20 0.25 0.10 0.12 0.15 0.08 │
│ Layer 5: 0.18 0.22 0.12 0.10 0.18 0.08 │
│ │
│ >>> The embedding (EMB) keeps 18-35% weight throughout! │
│ >>> "Even deep layers remember the fundamentals" │
│ >>> Like Goku always remembering his Grandpa's teachings │
└─────────────────────────────────────────────────────────────┘
Pre-MLP layers (before feed-forward):
┌─────────────────────────────────────────────────────┐
│ Source: EMB B0 B1 B2 B3 B4 B5 │
│ Layer 1: 0.30 0.15 0.05 0.25 0.05 0.02 │
│ Layer 2: 0.20 0.25 0.10 0.20 0.10 0.03 │
│ Layer 3: 0.15 0.30 0.12 0.15 0.15 0.08 │
│ Layer 4: 0.12 0.28 0.15 0.12 0.18 0.08 │
│ │
│ >>> MLP layers focus more on RECENT representations │
│ >>> "MLP is more local, attention is more global" │
└─────────────────────────────────────────────────────────────┘
══════════════════════════════════════════════════════════════════╗
║ ║
║ THREE KEY OBSERVATIONS: ║
║ ║
║ 1. LOCALITY IS PRESERVED ║
║ Each layer prefers its immediate predecessor ║
║ (diagonal dominance in the heatmaps) ║
║ But selective off-diagonal weights emerge! ║
║ Like: "Mostly fight the person in front of you, ║
║ but occasionally call for backup from far away" ║
║ ║
║ 2. THE EMBEDDING IS NEVER FORGOTTEN ║
║ It maintains 18-35% weight in all layers ║
║ Like: "Goku's saiyan training with Grandpa Gohan ║
║ is ALWAYS useful, even at the end of the fight" ║
║ ║
║ 3. LAYER SPECIALIZATION ║
║ Pre-attention layers maintain broader receptive fields ║
║ Pre-MLP layers focus more on recent states ║
║ Like: "Strategists think globally, fighters act locally" ║
║ ║
╚═════════════════════════════════════════════════════════════╝
══════════════════════════════════════════════════════════════════╗
║ ║
║ Method How it mixes depth Input-dep? Memory Loss ║
║ ───────────────────────────────── ───────────── ────────── ───────── ─────── ║
║ Standard Add all equally No 0 1.766 ║
║ ReZero Add with alpha Static 0 1.762 ║
║ Highway Gate: keep/borrow Dynamic 0 1.760 ║
║ LayerScale Scale each output Static 0 1.759 ║
║ DenseFormer Learned scalars Static 0 1.767 ║
║ mHC m parallel streams Dynamic m*d 1.747 ║
║ Full AttnRes Softmax attention Dynamic L*d 1.737 ║
║ Block AttnRes Block softmax Dynamic N*d 1.746 ║
║ ║
║ ⚡ Input-dependent weights are the KEY differentiator. ║
║ DenseFormer adds learned weights but they're FIXED after training. ║
║ AttnRes weights change for every input. ║
╚═══════════════════════════════════════════════════════════════╝
╔════════════════════════════════════════════════════════════════╗
║ ║
║ 🐉 PROBLEM: Standard residuals add all layers with weight=1. ║
║ Early layers get DILUTED (4.5% by layer 100). ║
║ PreNorm causes magnitudes to GROW 10x with depth. ║
║ Like every fighter in the Snake Way blindly ║
║ adding their full power to the next. ║
║ ║
║ ⚡ FIX: Attention Residuals (AttnRes) ║
║ Replace fixed weight=1 with SOFTMAX ATTENTION. ║
║ Each layer CHOOSES which previous layers to borrow from. ║
║ Like each fighter strategically choosing teammates. ║
║ ║
║ 🧩 FULL ATTENRES: Perfect but expensive (O(L*d) memory) ║
║ Every layer attends over ALL previous layers. ║
║ Too much memory at scale. ║
║ ║
║ 🧩 BLOCK ATTENRES: Practical and almost as good ║
║ Group layers into ~8 blocks. ║
║ Attend over block summaries instead of individual layers. ║
║ 12.5x less memory. 95% of the benefit. ║
║ ║
║ 📊 RESULTS: ║
║ GPQA-Diamond: +7.5 (multi-step reasoning!) ║
║ Math: +3.6 (mathematical reasoning) ║
║ HumanEval: +3.1 (code generation) ║
║ C-Eval: +2.9 (Chinese understanding) ║
║ ALL benchmarks improved. ZERO regressions. ║
║ Matches baseline at 1.25x less compute. ║
║ ║
║ 🎉 INFERENCE: <2% overhead. Almost free! ║
║ ║
║ 💡 KEY INSIGHTS: ║
║ - Softmax competition > sigmoid (need competition) ║
║ - RMSNorm on keys (prevent big layer dominance) ║
║ - Distant layers matter more than nearby (sliding window bad)║
║ - Embedding never forgotten (18-35% weight always) ║
║ - AttnRes favors DEEPER, narrower models ║
║ ║
║ 🔑 THE CORE IDEA: ║
║ Transformers replaced RNN recurrence with attention over ║
║ SEQUENCE. AttnRes does the SAME THING for DEPTH. ║
║ It's the same idea, applied to a different dimension. ║
║ ║
╚══════════════════════════════════════════════════════════════╝
Paper: arXiv:2603.15031 | Code: GitHub | March 16, 2026