Attention Residuals (AttnRes) - The Dragon Ball Z Edition

"I used to be a standard residual connection. Then I learned to pay attention."

Paper: https://arxiv.org/abs/2603.15031 Code: https://github.com/MoonshotAI/Attention-Residuals (2.8k stars) Authors: Kimi Team (Moonshot AI), March 2026 Model: Kimi Linear - 48B total / 3B activated, trained on 1.4T tokens

The Problem (Dragon Ball Z Analogy)

How Transformers Currently Work

Imagine Goku fighting through the Snake Way:

    ╔═════════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  THE SNAKE WAY - Every fighter adds their power to the next fighter  ║
    ║                                                               ║
    ║  Fighter 1 (Goku base)          Power Level: 9001              ║
    ║      │                                                        ║
    ║      v                                                        ║
    ║  Fighter 2 = Fighter 1 + Training  Power Level: 9001 + 2000  ║
    │     │                     = 11001                      ║
    │     │                                                        ║
    │     v                                                        ║
    │  Fighter 3 = Fighter 2 + Training  Power Level: 11001 + 3000  ║
    │     │                     = 14001                      ║
    │     │                                                        ║
    │     v                                                        ║
    │  ...                                                        ║
    │     v                                                        ║
    │  Fighter 100                    Power Level: 198,001            ║
    │                                  = 9001 + 99*2000           ║
    │                                  ^^^^^^^^^^^^^^^^            ║
    │                                  ALL fighters are weighted     ║
    │                                  EQUALLY. Fighter 1's           ║
    │                                  original 9001 is now           ║
    │                                  just 4.5% of the total!      ║
    │                                                                ║
    ╚═════════════════════════════════════════════════════════════════════╝
    
    Fighter 1 (Goku's base power) is almost INVISIBLE by the end.
    
    Same thing happens in LLMs - early layer knowledge gets DILUTED.

This is exactly how every modern AI model (GPT, LLaMA, DeepSeek) works. It's called the residual connection:

    h_layer_l = h_layer_(l-1) + Transformation(h_layer_(l-1))
                  ^^^^^^^^^^^^       ^^^^^^^^^^^^^^^^
                  ALL previous       This layer's
                  power added       new contribution
                  with weight=1       with weight=1
                  
    EVERY layer gets weight = 1. NO EXCEPTIONS. NO SELECTIVITY.

Why This Is Actually Terrible

    THE DILUTION PROBLEM:
    
    Layer  1's contribution to Layer 100:  1/100 = 1%    😱
    Layer 50's contribution to Layer 100:  1/100 = 1%    😱
    Layer 99's contribution to Layer 100:  1/100 = 1%    😱
    
    Fighter 1's power is just 4.5% of the total! Even though
    Fighter 1 learned FUNDAMENTAL stuff like "how to punch."
    
    In LLM terms:
    Layer 1 learned basic syntax (like "subjects come before verbs")
    Layer 50 learned complex reasoning patterns
    Layer 99 learned... something
    
    By the end, Layer 1's syntax knowledge is drowned out.

The PreNorm Problem (Bonus Disaster)

Most modern LLMs use Pre-LayerNorm (normalize before each layer). This makes the problem WORSE:

    WITH PRENORM, HIDDEN STATES GROW UNBOUNDABLY:
    
    Layer   1:  magnitude = 1.0x
    Layer 10:  magnitude = 3.2x
    Layer 30:  magnitude = 5.5x
    Layer 60:  magnitude = 7.8x
    Layer 100: magnitude = 10.1x    😱😱
    
    It's like every fighter in the Snake Way keeps getting BIGGER
    but the power levels aren't balanced. The later fighters
    are 10x stronger just because they've accumulated more stuff,
    not because they're actually 10x better.
    
    This is called "PreNorm dilution" and it's a KNOWN BUG
    in modern LLMs that everyone just lives with.

The Solution: Attention Residuals (AttnRes)

The Idea (Dragon Ball Z Version)

What if instead of blindly adding everyone's power, each fighter could CHOOSE which previous fighters to draw power from?

    ╔═════════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  ATTENTION RESIDUALS - Each fighter CHOOSES who to borrow from     ║
    ║                                                               ║
    ║  Fighter 100 needs to attack!                            ║
    ║                                                               ║
    ║  OLD WAY (Standard Residuals):                            ║
    │  Fighter 100 = 100% of Fighter 1                          ║
    │            + 100% of Fighter 2                          ║
    │            + ...                                       ║
    │            + 100% of Fighter 99                         ║
    │                                                        ║
    │  Problem: Fighter 1's technique is lost in the noise    ║
    ║                                                               ║
    ║  NEW WAY (Attention Residuals):                           ║
    │                                                        ║
    │  Fighter 100 thinks:                                    ║
    │    "I need precise combat technique... I'll borrow 35%    ║
    │     from Fighter 1 (he learned the basics!)"            ║
    │    "I need raw power... I'll borrow 25% from             ║
    │     Fighter 50"                                        ║
    │    "I need... nah, Fighter 99 is useless for this"      ║
    │    "I'll take 5% from Fighter 2"                        ║
    │    "And the remaining 35% from my own training"          ║
    │                                                        ║
    │  Fighter 100 = 0.35*F1 + 0.25*F50 + 0.05*F2 + ...      ║
    │                 0.35*F100                                 ║
    │                                                        ║
    ║  >>> Fighter 1's technique is PRESERVED and amplified!   ║
    ║  >>> Each fighter gets CUSTOM weights based on NEED!      ║
    ║                                                               ╚═══════════════════════════════════════════════════════════════════╝

The Math (Still Simple, I Promise)

    STANDARD RESIDUAL:
    h_l = h_{l-1} + f_{l-1}(h_{l-1})
    
    Which unrolls to:
    h_l = h_1 + f_1(h_1) + f_2(h_2) + ... + f_{l-1}(h_{l-1})
          ^^^^  ^^^^^^^^^   ^^^^^^^^^        ^^^^^^^^^^^^^^^
          w=1     w=1          w=1                 w=1
          ALL WEIGHTS ARE 1
    
    ══════════════════════════════════════════════════════════════
    
    ATTENTION RESIDUAL:
    h_l = alpha_{0->l} * v_0 + alpha_{1->l} * v_1 + ... + alpha_{l-1->l} * v_{l-1}
          ^^^^^^^^^^^^    ^^^^      ^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^
          softmax          softmax    softmax          softmax
          attention        attention    attention        attention
          weights          weights      weights          weights
    
    Where alpha_{i->l} = "how much should layer l borrow from layer i"
    
    THE WEIGHTS ARE LEARNED AND INPUT-DEPENDENT!
    Different inputs = different borrowing patterns!
    
    ══════════════════════════════════════════════════════════════

How The Attention Weights Are Computed (The Cheapest Trick Ever)

    Each layer l has ONE learned vector: w_l (like a "preference list")
    
    To decide how much to borrow from layer i, compute:
    
    score = w_l . RMSNorm(v_i)
            ^^^    ^^^^^^^^^
            layer's     normalize the value
            preference  (prevent big layers
                        from dominating)
    
    Then softmax turns scores into probabilities (sum to 1.0)
    
    ┌─────────────────────────────────────────────────────────┐
    │                                                          │
    │  Layer 12 is deciding who to borrow from:               │
    │                                                          │
    │  w_12 = [0.3, -0.1, 0.8, ...]  (learned preference)│
    │                                                          │
    │  v_0  (basic syntax):   score = 2.1                    │
    │  v_1  (word meanings):  score = 0.3                    │
    │  v_2  (patterns):      score = -0.5                   │
    │  v_11 (context):       score = 1.2                    │
    │                                                          │
    │  After softmax:                                          │
    │  v_0:  45%  <<< "I need the basics!"               │
    │  v_1:  12%                                         │
    │  v_2:   3%  <<< "Not relevant right now"             │
    │  v_11: 25%     <<< "Some context helps"                  │
    │  v_self: 15%                                          │
    │                                                          │
    │  COST: ONE vector (w_l) per layer. That's it.         │
    │  Cheaper than buying a senzu bean on discount!          │
    │                                                          │
    └─────────────────────────────────────────────────────────┘

Why This Matters: The Formal Argument

The Time-Depth Duality (The Core Insight)

The paper's key insight is a beautiful observation:

    ╔═════════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  SEQUENCE MODELING (RNNs)  <---->  DEPTH (Residuals)      ║
    ║                                                               ║
    ║  In sequences:                                                 ║
    │  RNNs compressed ALL past tokens into ONE state             │
    │  Then Transformers replaced this with ATTENTION               │
    │  (each token can selectively access ALL previous tokens)       ║
    │                                                              │
    │  In depth:                                                    ║
    │  Residuals compressed ALL past layers into ONE state          │
    │  (h_{l-1} is a soup of everything before it)              ║
    │                                                              │
    │  AttnRes applies the SAME fix:                                ║
    │  Replace compression with ATTENTION over depth!               ║
    │  Each layer can selectively access ALL previous layers          ║
    │                                                              ║
    ║  ⚡ This is literally the same idea that made Transformers     ║
    │    dominant over RNNs, but applied to the depth dimension!  ║
    │                                                              ║
    ╚═════════════════════════════════════════════════════════════════════╝

Why This Works In Practice (Depth is Small)

    SEQUENCE ATTENTION: "But depth is small, you can't do O(L^2)!"
    
    Actually, depth is TINY compared to sequence length!
    
    ┌────────────────────────────────────────────────────┐
    │                                                     │
    │  Sequence length:  1,000 to 1,000,000 tokens      │
    │  Model depth:      12 to 128 layers                │
    │                                                     │
    │  O(128^2) = 16,384 operations per token         │
    │  This is NOTHING compared to sequence attention!    │
    │  Even O(1000^2) would be manageable.              │
    │                                                     │
    │  >>> Attention over depth is CHEAP and FEASIBLE    │
    │                                                     │
    └────────────────────────────────────────────────────┘

The Problem at Scale: Block AttnRes

Full AttnRes Works Great But Is Too Expensive

    FULL ATTENRES (The Dream):
    
    Layer  1:  attend over [v_0]                     -- 1 value
    Layer  2:  attend over [v_0, v_1]                 -- 2 values
    Layer  3:  attend over [v_0, v_1, v_2]              -- 3 values
    ...
    Layer 99:  attend over [v_0, v_1, ..., v_98]          -- 99 values!
    
    MEMORY NEEDED per token:
    
    For 100 layers, d=4096:
    100 * 4096 = 409,600 values
    
    With batch=1, seq_len=8192:
    409,600 * 8,192 = 3.35 BILLION values
    At fp16 = 6.7 GB just for residuals! 😱
    
    Plus, in pipeline parallelism (multiple GPUs), every GPU needs
    ALL layer outputs from previous GPUs. That's MASSIVE communication.
    
    ┌─────────────────────────────────────────────────────────┐
    │  Full AttnRes: Perfect selectivity, but 6.7 GB extra    │
    │                per token. Like carrying the entire     │
    │                Snake Way roster on your back.           │
    └─────────────────────────────────────────────────────────┘

Block AttnRes: The Practical Compromise

    BLOCK ATTENRES (The Reality):
    
    Instead of attending over every single layer, GROUP layers into blocks.
    
    100 layers → 8 blocks of ~12 layers each
    
    Within a block: standard residuals (cheap, no attention)
    Between blocks: attention over block representations (selective!)
    
    ┌─────────────────────────────────────────────────────────────┐
    │                                                              │
    │  Block 0 (layers 1-12):     Standard residuals                │
    │    Summary: "Here's everything from layers 1-12"              │
    │                                                              │
    │  Block 1 (layers 13-24):    Standard residuals                │
    │    Before each sub-layer: attend over [Block0_summary]    │
    │    Summary: "Here's everything from blocks 0-1"           │
    │                                                              │
    │  Block 2 (layers 25-36):    Standard residuals                │
    │    Before each sub-layer: attend over [B0, B1_summary]    │
    │    Summary: "Here's everything from blocks 0-2"           │
    │                                                              │
    │  ...                                                          │
    │                                                              │
    │  Block 7 (layers 88-100):   Standard residuals                │
    │    Before each sub-layer: attend over [B0...B6]           │
    │    Summary: "Here's everything from blocks 0-7"           │
    │                                                              │
    │  ATTEND: Final output aggregates all 8 block summaries     │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘
    
    MEMORY: 8 blocks * d = 8 * 4096 = 32,768 values
    vs Full AttnRes: 100 * 4096 = 409,600 values
    = 12.5x LESS MEMORY! 🎉

The Block Attention Mechanism (Visual)

    LAYER 36 (in Block 3) needs to compute its input:
    
    ┌─────────────────────────────────────────────────────────────┐
    │                                                              │
    │  Step 1: Stack all available block summaries                    │
    │                                                              │
    │     V = [Block0, Block1, Block2, PartialBlock3]               │
    │         │       │       │       │                        │
    │         │       │       │       v                        │
    │         │       │       │  (running sum of            │
    │         │       │       │   layers 25-35 so far)      │
    │                                                              │
    │  Step 2: Normalize all values                                  │
    │     K = RMSNorm(V)                                             │
    │                                                              │
    │  Step 3: Compute attention scores using layer 36's preference     │
    │     w_36 . K  →  [2.1, 0.3, -0.5, 1.2]                  │
    │                                                              │
    │  Step 4: Softmax (turn into percentages)                     │
    │     weights = [0.35, 0.12, 0.03, 0.25, 0.25]           │
    │                                                              │
    │  Step 5: Weighted sum                                      │
    │     output = 0.35*B0 + 0.12*B1 + 0.03*B2 + 0.25*Partial  │
    │                                                              │
    │  >>> Layer 36 borrows most from Block 0 (early features)     │
    │      because that's what this particular token needs!          │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘

Why ~8 Blocks Is The Sweet Spot

    BLOCK SIZE SWEEP (from the paper):
    
    Blocks:    32   16    8     4     2     1
              │    │    │     │     │     │
    Loss:    1.77 1.77 1.75  1.75 1.82  1.74
              │    │    │     │     │     │
              │    │    │     │     │     │
              ▼    ▼    ▼     ▼     ▼     ▼
              BASELINE
    
    Block 1 (Full AttnRes): Best loss, but too expensive
    Blocks 2-8:              Nearly identical performance!
    Block 16+:                Degrading back toward baseline
    Block 32:                Basically baseline (standard residuals)
    
    ┌─────────────────────────────────────────────────────────────┐
    │                                                              │
    │  The Dragon Ball Z Analogy:                                  │
    │                                                              │
    │  8 blocks = 8 Dragon Ball fighters in a team                  │
    │                                                              │
    │  Instead of ALL 100 fighters adding their power to each      │
    │  fighter's attack (chaos, dilution):                         │
    │                                                              │
    │  Fighter 100 says: "Hey, let me check with the team           │
    │   captain (Block 0), then the vice-captain (Block 1),          │
    │   then..."                                              │
    │                                                              │
    │  Only needs to coordinate with 8 team captains,               │
    │  not all 99 fighters. Much more organized!                  │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘

Training Infrastructure: How They Actually Made It Work

The Pipeline Problem

In real training, different GPUs handle different layers (pipeline parallelism). Block AttnRes needs block summaries from ALL previous GPUs, which is a communication nightmare.

    NAIVE APPROACH (Bad):
    
    GPU 0:  has [Block0]
    GPU 1:  needs [Block0, Block1]  → GPU 0 sends Block0
    GPU 2:  needs [Block0, Block1, Block2]  → GPU 1 sends B0, B1
    GPU 3:  needs [Block0, Block1, Block2, Block3]  → GPU 2 sends B0, B1, B2
    ...
    
    EVERY TRANSFER RE-SENDS EVERYTHING. Redundant!
    
    Communication cost: O(C^2 * N * d)  where C = pipeline chunks
    For 4 GPUs, 2 virtual stages: ~12 redundant transfers!

Cache-Based Pipeline Communication (The Fix)

    SMART APPROACH (What they actually do):
    
    GPU 0:  has [Block0]
    GPU 1:  has [Block0, Block1]  (cached Block0 locally!)
    GPU 2:  has [Block0, Block1, Block2]  (cached B0, B1!)
    
    When GPU 2 needs to send to GPU 3:
    → Only sends the NEW Block2 (incremental!)
    
    ┌─────────────────────────────────────────────────────────────┐
    │                                                              │
    │  GPU 0          GPU 1          GPU 2          GPU 3           │
    │  ┌─────┐      ┌──────┐       ┌──────┐       ┌──────┐        │
    │  │B0   │      │B0    │      │B0    │       │B0    │        │
    │  └──┬──┘      │B1    │      │B1    │       │B1    │        │
    │     └────────>│──────│──────>│      │       │B2    │        │
    │                │      │       │B2    │       └──┬──┘        │
    │                │      │       └──┬───┘           │B3    │        │
    │                │      │           │                  └──────┘        │
    │                │      │           │                                  │
    │                └──────┘           │                                  │
    │                                                              │
    │  Each GPU caches blocks from previous stages.                │
    │  Only NEW blocks are transmitted.                           │
    │  Peak communication drops from O(C) to O(P) per transition.    │
    │                                                              │
    │  That's a 2x improvement in communication!                  │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘

Two-Phase Computation (Inference Trick)

    THE TWO-PHASE TRICK FOR INFERENCE:
    
    Phase 1 (Parallel - batch all queries):
    ┌──────────────────────────────────────────────────────┐
    │                                                          │
    │  Within Block 3, all 12 layers need attention over        │
    │  [Block0, Block1, Block2].                              │
    │                                                          │
    │  OLD WAY: Read blocks 12 times (once per layer)         │
    │  NEW WAY: Read blocks ONCE, batch all 12 queries,         │
    │           compute all answers simultaneously.                    │
    │                                                          │
    │  Read cost: 12 reads → 1 read. 12x speedup! 🚀       │
    │                                                          │
    └──────────────────────────────────────────────────────┘
    
    Phase 2 (Sequential - merge with local):
    ┌──────────────────────────────────────────────────────┐
    │                                                          │
    │  Within Block 3, layer 5 needs to attend over            │
    │  the PARTIAL sum (layers 25-29 within the block).       │
    │                                                          │
    │  This must be sequential because each layer's partial       │
    │  sum changes. But it's just 1 read per layer.           │
    │                                                          │
    │  Uses "online softmax" to merge Phase 1 and Phase 2      │
    │  results exactly. No approximation!                        │
    │                                                          │
    └──────────────────────────────────────────────────────┘
    
    TOTAL INFERENCE OVERHEAD: Less than 2%! Almost free!

Results: The Dragon Ball Z Power Levels

Scaling Laws

    VALIDATION LOSS vs COMPUTE BUDGET:
    
    Loss │
    1.85 ┤                                          * Full AttnRes
    1.80 ┤                                        * Block AttnRes
    1.75 ┤     *                                    * Baseline
    1.70 ┤      * *  * *  *  *  *  *  *  *  *  *  *  *  * *
    1.65 ┤                *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
    1.60 ┤  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
    1.55 ┤
         └──────────────────────────────────────────
          0.5    1    2    5    10   20   50
                  PFLOP/s-days (compute budget)
    
    Block AttnRes at compute X = Baseline at compute X * 1.25
    >>> 25% MORE compute-efficient!
    >>> Same loss, 25% less training cost!

Downstream Benchmarks (48B Model, 1.4T Tokens)

    ╔═════════════════════════════════════════════════════════════════════╗
    ║  Benchmark              Baseline    AttnRes     Gain     Dragon Ball Z     ║
    ╠════════════════════════════════════════════════════════════════════╣═════════════════════╣
    ║                                                              ║
    ║  REASONING                                                     ║
    ║  GPQA-Diamond        36.9       44.4      +7.5    🐉🥊🥊🥊🥊🥊🥊🥊   ║
    ║  Math               53.5       57.1      +3.6    🥊🥊🥊🥊🥊🥊🥊🥊     ║
    ║  HumanEval          59.1       62.2      +3.1    🥊🥊🥊🥊🥊🥊🥊🥊     ║
    ║  MBPP               72.0       73.9      +1.9    🥊🥊🥊🥊🥊🥊🥊🥊     ║
    ║                                                              ║
    ║  GENERAL KNOWLEDGE                                             ║
    ║  BBH                76.3       78.0      +1.7    🥊🥊🥊🥊🥊🥊🥊🥊     ║
    ║  TriviaQA           69.9       71.8      +1.9    🥊🥊🥊🥊🥊🥊🥊     ║
    ║  MMLU               73.5       74.6      +1.1    🥊🥊🥊🥊🥊🥊🥊     ║
    ║  ARC-Challenge       64.6       65.7      +1.1    🥊🥊🥊🥊🥊🥊🥊     ║
    ║                                                              ║
    ║  CHINESE                                                     ║
    ║  C-Eval             79.6       82.5      +2.9    🥊🥊🥊🥊🥊🥊🥊     ║
    ║  CMMLU             82.0       82.9      +0.9    🥊🥊🥊🥊🥊🥊     ║
    ║                                                              ║
    ╚═══════════════════════════════════════════════════════════════════╝
    
    BIGGEST WINS on multi-step reasoning (+7.5 GPQA) and code (+3.1 HumanEval).
    This makes sense! When doing complex reasoning, you need to
    go back to EARLY layers for fundamental knowledge.
    Standard residuals bury that. AttnRes un-buries it.

Training Dynamics: What Actually Changes

    HIDDEN STATE MAGNITUDES (across depth):
    
    Standard Residuals:                    Block AttnRes:
    
    mag │                                         │
    10  │                          *             │
     8  │                       * *           │
     6  │                      * * *          │
     4  │                     * * * *         │
     2  │            * * * * * * * *    │
     0  │* * * * * * * * * * * *     │* * * * * * * * * * *
     └──────────────────────────────    └──────────────────────
      Layer 1              Layer 100             Layer 1              Layer 100
     
    GROWS 10x!                    STAYS FLAT!
    Later layers are                Early layers are
    10x stronger just                 preserved. No
    because they accumulated              "later layers are
    more stuff, not because              stronger."
    
    ┌─────────────────────────────────────────────────────────────┐
    │  It's like the difference between:                         │
    │                                                              │
    │  1) A saiyan who keeps transforming (magnitude grows        │
    │     with each form) vs                                     │
    │                                                              │
    │  2) Goku who uses Kaioken strategically                    │
    │     (powers up, borrows, powers down as needed)         │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘

    GRADIENT MAGNITUDES (across depth):
    
    Standard Residuals:                    Block AttnRes:
    
    grad │ *                                        │
     8  │ *                                        │
     6  │ * *                                      │
     4  │ * * * * *                                 │ * * * * * * *
     2  │ * * * * * * * * * * * * * * * * * * * * * * * * * │
     0  │* * * * * * * * * * * * * * * * * * * * * * * * * * │* * * * * * * *
     └─────────────────────────────────────────────────────────────────┘
      Layer 1              Layer 100             Layer 1              Layer 100
     
    First layers get                       All layers get similar
    HUGE gradients (unstable!)            gradients (stable!)
    
    WHY? Because attention weights create COMPETITION.
    If one source layer is sending too many gradients,
    another one takes over. Self-regulating!
    
    ┌─────────────────────────────────────────────────────────────┐
    │  Standard:  "ALL GRADIENTS GO THROUGH EVERY LAYER!"     │
    │              (Layer 1 gets crushed by 99 layers)        │
    │                                                              │
    │  AttnRes:    "Each layer competes for gradient          │
    │              flow. Weights adjust automatically."        │
    │              (Healthy competition, like a tournament)       │
    └─────────────────────────────────────────────────────────────┘

Ablations: What They Tested And Why

Component-by-Component Breakdown

    ╔═══════════════════════════════════════════════════════════════════╗
    ║  Variant                           Loss     vs Baseline  Verdict     ║
    ╠══════════════════════════════════════════════════════════════════╣═════════════════════╣
    ║  Baseline (PreNorm)                  1.766     --          Standard       ║
    ║  DenseFormer (static weights)        1.767     +0.001     NO GAIN        ║
    ║  mHC (multi-stream mixing)           1.747     +0.019     Better         ║
    ║  Full AttnRes                       1.737     +0.029     BEST 🔥       ║
    ║  Block AttnRes (8 blocks)           1.746     +0.020     Almost best     ║
    ║  Block AttnRes (4 blocks)           1.746     +0.020     Same as 8 blocks║
    ║                                                              ║
    ║  KEY ABLATIONS:                                                 ║
    ╠══════════════════════════════════════════════════════════════════╣═════════════════════╣
    ║  Sigmoid instead of softmax          1.741     WORSE       Competition  ║
    ║  Multihead depth attention          1.752     WORSE       Uniform mix    ║
    ║  No RMSNorm on keys               1.743/750 WORSE       Norm matters  ║
    ║  Input-dependent query (from h_l)     1.731     BETTER      But costs d^2  ║
    ║  Sliding window (W=8)                1.764     SLIGHT GAIN   Local only    ║
    ║                                                              ║
    ╚═══════════════════════════════════════════════════════════════════╝
    
    KEY INSIGHTS:
    
    1. Input-dependent weights are CRUCIAL (DenseFormer with static
       weights showed NO gain over baseline)
    
    2. Softmax normalization (competition) beats sigmoid
    
    3. Depth mixing should be UNIFORM across channels (multihead hurts)
    
    4. RMSNorm on keys prevents big layers from dominating
       (like normalizing power levels so everyone fights fair)
    
    5. Distant layers matter MORE than nearby ones (sliding window
       with W=8 is much worse than block attention with N=8)

The Deep Theory: Residuals as Structured Matrices

Every Residual Method Is A Matrix (Mind-Blowing Section)

    The paper shows that ALL residual variants can be viewed as a "depth mixing matrix" M.
    
    M_{i→l} = "how much does layer l borrow from layer i"
    
    ══════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  STANDARD RESIDUAL (M = all-ones matrix):                ║
    ║                                                               ║
    ║       L0  L1  L2  L3                                   ║
    ║  L0 [  1   1   1 ]                                 ║
    ║  L1 [  1   1   1 ]                                 ║
    ║  L2 [  1   1   1 ]    Every layer gets weight 1    ║
    ║  L3 [  1   1   1 ]    regardless of input.        ║
    ║                                                               ║
    ║  Like: "Every fighter adds their full power to the next"      ║
    ║  Dragon Ball: "Everyone goes Super Saiyan, no strategy"   ║
    ║                                                               ╚═══════════════════════════════════════════════════════════╝
    
    
    ══════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  ATTENTION RESIDUAL (M = dense, rank-L matrix):           ║
    ║                                                               ║
    ║       L0  L1  L2  L3                                   ║
    ║  L0 [ 0.5 0.1 0.3 0.1 ]   ← input A: "I need basics"  ║
    ║  L1 [ 0.2 0.4 0.1 0.3 ]   ← input B: "I need recent" ║
    ║  L2 [ 0.1 0.2 0.1 0.6 ]   ← input C: "I need context"║
    ║  L3 [ 0.2 0.3 0.5 0.0 ]   ← input D: "I need latest" ║
    ║                                                               ║
    #  Each row sums to 1.0 (softmax normalization)               ║
    #  Each row is DIFFERENT for different inputs (input-dependent)  ║
    #  Like: "Each fighter picks their teammates strategically"     ║
    #  Dragon Ball: "The team captain assigns roles dynamically"     ║
    ║                                                               ╚═══════════════════════════════════════════════════════════╝
    
    
    ══════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  HIGHWAY NETWORKS (M = 1-semiseparable matrix):          ║
    ║                                                               ║
    ║       L0  L1  L2  L3                                   ║
    ║  L0 [ g1  0   0   0  ]    g1 = gate for layer 1         ║
    ║  L1 [ g1  g2  0   0  ]    g2 = gate for layer 2         ║
    ║  L2 [ g1  g2  g3  0  ]    g3 = gate for layer 3         ║
    ║  L3 [ g1  g2  g3  g4  ]    gates are learned           ║
    ║                                                               ║
    #  Gates decide: "keep my own power or borrow from previous"     ║
    #  Like: "Each fighter decides: use my move or borrow?"        ║
    #  Dragon Ball: "Each fighter has a fusion move"              ║
    ║                                                               ╚═════════════════════════════════════════════════════════════╝

The Key Theoretical Insight

    ALL existing methods (Standard, Highway, mHC) are instances of
    LINEAR ATTENTION over depth.
    
    AttnRes is SOFTMAX ATTENTION over depth.
    
    This is the SAME transition that made Transformers dominant over RNNs:
    
    RNNs (linear attention) → Transformers (softmax attention) = GAME CHANGER
    Residuals (linear attention) → AttnRes (softmax attention) = POTENTIAL GAME CHANGER
    
    The authors call this the "Sequence-Depth Duality" and it's 
    one of the most elegant observations in the paper.

Architecture Sweep: What Shape Is Optimal?

    Under fixed compute budget, what depth/width is optimal?
    
    ┌─────────────────────────────────────────────────────────────────┐
    │  BASELINE optimal:                                             │
    │                                                              │
    │      dmodel/Lb = 60 (wider, shallower)                       │
    │      "Make it wide, not deep"                                   │
    │                                                              │
    │  ╔════════════╗                                                  │
    │                 ║                                                      │
    │                 ▼                                                      │
    │           dmodel/Lb = 15                                                       │
    │           dmodel/Lb = 30                                                       │
    │           dmodel/Lb = 45                                                       │
    │     dmodel/Lb = 60 ← OPTIMAL                                          │
    │           dmodel/Lb = 75                                                       │
    │                                                              │
    │                                                              │
    │  ATTNRES optimal:                                               │
    │                                                              │
    │      dmodel/Lb = 45 (deeper, narrower!)                        │
    │      "Make it deeper, AttnRes handles the depth"             │
    │                                                              │
    │  ╔════════════╗                                                  │
    │                 ║                                                      │
    │                 ▼                                                      │
    │           dmodel/Lb = 15                                                       │
    │           dmodel/Lb = 30                                                       │
    │     dmodel/Lb = 45 ← OPTIMAL                                          │
    │           dmodel/Lb = 60                                                       │
    │           dmodel/Lb = 75                                                       │
    │                                                              │
    │  AttnRes shifts the optimum toward DEEPER models because           │
    │  it can exploit depth more effectively. The selective access to         │
    │  earlier layers means deep models don't lose information.         │
    └─────────────────────────────────────────────────────────────────┘

Learned Attention Patterns: What Does The Model Actually Do?

    ATTENTION WEIGHT PATTERNS (16-head model):
    
    Pre-Attention layers (before self-attention):
    ┌─────────────────────────────────────────────────────┐
    │ Source:  EMB  B0   B1   B2   B3   B4   B5             │
    │ Layer 1:  0.35 0.12 0.03 0.25 0.05 0.02             │
    │ Layer 2:  0.30 0.15 0.05 0.20 0.10 0.03             │
    │ Layer 3:  0.25 0.20 0.08 0.15 0.12 0.05             │
    │ Layer 4:  0.20  0.25 0.10 0.12 0.15 0.08             │
    │ Layer 5:  0.18  0.22 0.12 0.10  0.18 0.08             │
    │                                                              │
    │  >>> The embedding (EMB) keeps 18-35% weight throughout!       │
    │  >>> "Even deep layers remember the fundamentals"                 │
    │  >>> Like Goku always remembering his Grandpa's teachings    │
    └─────────────────────────────────────────────────────────────┘
    
    Pre-MLP layers (before feed-forward):
    ┌─────────────────────────────────────────────────────┐
    │ Source:  EMB  B0   B1   B2   B3   B4   B5             │
    │ Layer 1:  0.30  0.15 0.05  0.25 0.05 0.02             │
    │ Layer 2:  0.20  0.25  0.10  0.20  0.10 0.03             │
    │ Layer 3:  0.15  0.30  0.12  0.15  0.15 0.08             │
    │ Layer 4:  0.12  0.28  0.15  0.12  0.18 0.08             │
    │                                                              │
    │  >>> MLP layers focus more on RECENT representations       │
    │  >>> "MLP is more local, attention is more global"          │
    └─────────────────────────────────────────────────────────────┘
    
    ══════════════════════════════════════════════════════════════════╗
    ║                                                              ║
    ║  THREE KEY OBSERVATIONS:                                       ║
    ║                                                              ║
    ║  1. LOCALITY IS PRESERVED                                ║
    ║     Each layer prefers its immediate predecessor            ║
    ║     (diagonal dominance in the heatmaps)                    ║
    ║     But selective off-diagonal weights emerge!          ║
    ║     Like: "Mostly fight the person in front of you,          ║
    ║      but occasionally call for backup from far away"       ║
    ║                                                              ║
    ║  2. THE EMBEDDING IS NEVER FORGOTTEN                      ║
    ║     It maintains 18-35% weight in all layers              ║
    ║     Like: "Goku's saiyan training with Grandpa Gohan  ║
    ║       is ALWAYS useful, even at the end of the fight"     ║
    ║                                                              ║
    ║  3. LAYER SPECIALIZATION                                ║
    ║     Pre-attention layers maintain broader receptive fields       ║
    ║     Pre-MLP layers focus more on recent states                ║
    ║     Like: "Strategists think globally, fighters act locally" ║
    ║                                                              ║
    ╚═════════════════════════════════════════════════════════════╝

Comparison With Other Methods

    ══════════════════════════════════════════════════════════════════╗
    ║                                                               ║
    ║  Method          How it mixes depth     Input-dep?  Memory    Loss     ║
    ║  ───────────────────────────────── ───────────── ────────── ───────── ─────── ║
    ║  Standard       Add all equally       No         0         1.766     ║
    ║  ReZero        Add with alpha      Static     0         1.762     ║
    ║  Highway       Gate: keep/borrow     Dynamic   0         1.760     ║
    ║  LayerScale    Scale each output    Static     0         1.759     ║
    ║  DenseFormer   Learned scalars        Static     0         1.767     ║
    ║  mHC           m parallel streams    Dynamic   m*d       1.747     ║
    ║  Full AttnRes   Softmax attention    Dynamic   L*d       1.737     ║
    ║  Block AttnRes   Block softmax        Dynamic   N*d       1.746     ║
    ║                                                               ║
    ║  ⚡ Input-dependent weights are the KEY differentiator.              ║
    ║    DenseFormer adds learned weights but they're FIXED after training.  ║
    ║    AttnRes weights change for every input.                        ║
    ╚═══════════════════════════════════════════════════════════════╝

TL;DR

    ╔════════════════════════════════════════════════════════════════╗
    ║                                                              ║
    ║   🐉 PROBLEM: Standard residuals add all layers with weight=1.         ║
    ║      Early layers get DILUTED (4.5% by layer 100).         ║
    ║      PreNorm causes magnitudes to GROW 10x with depth.              ║
    ║      Like every fighter in the Snake Way blindly                ║
    ║      adding their full power to the next.                     ║
    ║                                                              ║
    ║   ⚡ FIX: Attention Residuals (AttnRes)                             ║
    ║      Replace fixed weight=1 with SOFTMAX ATTENTION.              ║
    ║      Each layer CHOOSES which previous layers to borrow from.     ║
    ║      Like each fighter strategically choosing teammates.             ║
    ║                                                              ║
    ║   🧩 FULL ATTENRES: Perfect but expensive (O(L*d) memory)       ║
    ║      Every layer attends over ALL previous layers.                  ║
    ║      Too much memory at scale.                                  ║
    ║                                                              ║
    ║   🧩 BLOCK ATTENRES: Practical and almost as good                  ║
    ║      Group layers into ~8 blocks.                          ║
    ║      Attend over block summaries instead of individual layers.     ║
    ║      12.5x less memory. 95% of the benefit.             ║
    ║                                                              ║
    ║   📊 RESULTS:                                              ║
    ║      GPQA-Diamond:   +7.5 (multi-step reasoning!)              ║
    ║      Math:           +3.6 (mathematical reasoning)          ║
    ║      HumanEval:      +3.1 (code generation)                  ║
    ║      C-Eval:         +2.9 (Chinese understanding)           ║
    ║      ALL benchmarks improved. ZERO regressions.                  ║
    ║      Matches baseline at 1.25x less compute.                   ║
    ║                                                              ║
    ║   🎉 INFERENCE: <2% overhead. Almost free!                    ║
    ║                                                              ║
    ║   💡 KEY INSIGHTS:                                         ║
    ║      - Softmax competition > sigmoid (need competition)       ║
    ║      - RMSNorm on keys (prevent big layer dominance)     ║
    ║      - Distant layers matter more than nearby (sliding window bad)║
    ║      - Embedding never forgotten (18-35% weight always)   ║
    ║      - AttnRes favors DEEPER, narrower models               ║
    ║                                                              ║
    ║   🔑 THE CORE IDEA:                                         ║
    ║      Transformers replaced RNN recurrence with attention over       ║
    ║      SEQUENCE. AttnRes does the SAME THING for DEPTH.     ║
    ║      It's the same idea, applied to a different dimension.     ║
    ║                                                              ║
    ╚══════════════════════════════════════════════════════════════╝

Paper: arXiv:2603.15031 | Code: GitHub | March 16, 2026

syntaxhacker/attention-residuals-deep-dive.md

Select an option

No results found