AttnRes replaces the standard residual connection in transformers with a depth attention mechanism — instead of simply adding each layer's output to a running sum, the model attends over previous layer outputs to decide what information to carry forward.
Standard transformers use x = x + layer(x) at every layer. AttnRes variants replace this with a learned attention operation across the depth axis: "which previous layers' outputs should I attend to when constructing the input to this layer?"
All experiments use a GPT-2-style decoder-only transformer trained on FineWeb-Edu (10B tokens), with RoPE, SwiGLU, and RMSNorm.