Skip to content

Instantly share code, notes, and snippets.

@josherich
Created July 3, 2025 17:10
Show Gist options
  • Save josherich/5c42fb71fd59cf8d06038cdfcd53631b to your computer and use it in GitHub Desktop.
Save josherich/5c42fb71fd59cf8d06038cdfcd53631b to your computer and use it in GitHub Desktop.
A Galileo moment for LLM design

OP

A Galileo Moment for LLM Design

Introduction

This document discusses the significant advancements in Large Language Model (LLM) architecture design, drawing parallels to pivotal moments in the history of science, such as the Pisa Tower experiment that catalyzed modern physics. Our findings reveal the true limits of LLM architectures through a controlled synthetic pretraining environment, marking a potential turning point in LLM research that may delineate the field into “before” and “after.”

Read more about Architecture Design and the Magic of Canon Layers
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers. Joint work with Alberto Alfarano


Real-Life Pretraining at Scale

In our experiments conducted at a repeatable academic scale, utilizing 100 billion tokens, we observed that architectural differences tend to vanish amidst noise. Our synthetic pretraining environment has the potential to transform the landscape of LLM research by revealing:

  • Clear trends, such as a 2x increase in reasoning depth
  • The early emergence of advanced skills
  • The predictive capacity of high-quality data for future designs

Image


Synthetic Pretraining Tasks

We have designed five synthetic pretraining tasks to isolate atomic skills. These tasks ensure:

  • True mental reasoning ("system-1") rather than mere Chain-of-Thought (CoT)
  • Short context lengths (4,000 tokens), which reflect the actual thinking capabilities of real models
  • The exclusion of toy tasks, allowing for genuine insights into architectural limits

Image


Introduction of Canon Layers

We introduce Canon layers, inspired by the concept of musical canons. These layers are lightweight horizontal residuals that are simple (averaging three past tokens) and can be integrated into any model. Their transformative features include:

  • Significant boosts in reasoning (2-4x depth, 30% breadth)
  • Minimal overhead and flexible integration
  • No tuning required

Image


Reviving NoPE with Canon Layers

Our Canon layers effectively revive the No Positional Embedding (NoPE) approach. They enable performance that matches or even surpasses RoPE while providing:

  • The best of both worlds: superior reasoning capabilities and excellent length generalization
  • Compatibility across attention mechanisms and Multi-Layer Perceptrons (MLPs), enhancing even Mixture of Experts (MoE) capacity
  • A safe, stable, and plug-and-play implementation

Image


Enhancing Linear Attention

While linear attention is known for its speed, it has traditionally been considered weak. Canon layers address this limitation by:

  • Achieving performance that matches or surpasses Mamba2
  • Providing a 4x increase in reasoning depth, over 100% increase in breadth, and a 50% increase in manipulation length
  • Ensuring safety, ease of use, and efficiency with a residual-only design, requiring no activation and minimal overhead

Image


Insights on Mamba's Architecture

It is noteworthy that Mamba's strength is largely derived from a hidden "conv1d" layer that resembles Canon layers, albeit in a weaker form, rather than its State Space Model (SSM). The implications of this are significant:

  • Removing the hidden layer results in Mamba defaulting to linear attention.
  • Replacing it with a complete Canon layer outperforms the original Mamba.
  • Canon layers demonstrate efficacy even outside of SSMs, revealing the core elements that drive performance.

Image


Conclusion

The advancements discussed herein signify a transformative era in LLM architecture design. By leveraging synthetic pretraining and innovative components like Canon layers, we are poised to deepen our understanding and enhance the capabilities of language models.

Generated by tweet-to-markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment