Seven Papers Converging on the Same Theory of Transformer Internals

Seven independent research programs — biology, statistical physics, neuroscience, cognitive science, pure mathematics, dynamical systems — all converging on the same structural description of how transformers compute. Different formalisms. Same underlying object.

Full isomorphism analysis in the thread. Here are the sources.

1. On the Biology of a Large Language Model

Lindsey, Gurnee, Ameisen, Chen, Pearce, Turner, Citro et al. — Anthropic (2025)

Circuit tracing through Claude 3.5 Haiku using attribution graphs built on cross-layer transcoders. Case studies on multi-step reasoning, planning in poems, multilingual circuits, addition, medical diagnosis, hallucinations, refusals, jailbreaks, chain-of-thought faithfulness, and hidden goals in misaligned models.

🔗 transformer-circuits.pub/2025/attribution-graphs/biology.html

2. Physics of Language Models

Zeyuan Allen-Zhu — MIT / ICML 2024 Tutorial

Controlled synthetic experiments to discover universal laws governing all LLMs. Breaks "intelligence" into dimensions (structure, reasoning, knowledge, architecture), trains hundreds of small models under varied conditions to find laws that hold regardless of scale.

🔗 physics.allen-zhu.com

3. The Neuroscience of Transformers

Peter König & Mario Negrello (2026)

Maps the transformer encoder/decoder architecture onto the laminar structure of the cortical column. Values → L4 feedforward pathway, Keys → tangential L2/3 stream, Queries → L1 contextual feedback, attention weights → dendritic coincidence detection. Generates a full table of testable predictions.

🔗 arxiv.org/abs/2603.15339

4. A Statistical Physics of Language Model Reasoning

Jack David Carson & Amir Reisizadeh — MIT (2025)

Models sentence-level hidden-state trajectories as a stochastic dynamical system (Itô SDE) with latent regime switching. Finds four reasoning regimes on a rank-40 manifold capturing 50% variance. SLDS achieves R² ≈ 0.68 on held-out trajectories. Validated on adversarial belief manipulation across 8 models and 7 benchmarks.

🔗 arxiv.org/abs/2506.04374

5. Large Language Models and Cognitive Science: A Comprehensive Review

Qian Niu, Junyu Liu, Ziqian Bi et al. (2024, updated 2025)

Survey of LLM-cognition comparisons across language processing, reasoning, memory, sensory judgment, and cognitive biases. Covers evaluation methods, LLMs as cognitive models, integration with cognitive architectures, and limitations.

🔗 arxiv.org/abs/2409.02387

6. Entropy, Thermodynamics and the Geometrization of the Language Model

Wenzhe Yang (2024)

Pure mathematics framework for language models using set theory, analysis, and statistical mechanics. Defines the moduli space of distributions, entropy function, partition function, internal energy, Helmholtz free energy, and the Boltzmann manifold. Argues zero-entropy points explain why LLMs need billions of parameters. Formulates an AGI conjecture.

🔗 arxiv.org/abs/2407.21092

7. Transformer Dynamics: A Neuroscientific Approach to Interpretability

Jesseba Fernando & Grigori Guitchounts — Northeastern / Independent (2025)

Treats the residual stream of Llama 3.1 8B as a dynamical system evolving across layers. Finds accelerating velocity, decreasing mutual information, unstable periodic orbits (~10.74 rotations per unit), and attractor-like self-correcting dynamics in lower layers. Proposes a "neuroscience of AI" combining dynamical systems theory with mechanistic interpretability.

🔗 arxiv.org/abs/2502.12131

Key Isomorphisms Across Papers

Connection	Papers	What converges
Boltzmann duality	Yang ↔ Anthropic	Entropy function and attribution graphs are formally dual operations on the same distribution
Orthogonal dynamical axes	Carson ↔ Fernando	Same multi-basin structure sliced across reasoning steps vs. across layers
Cortical prediction matches empirical finding	König ↔ Anthropic	Predicted laminar flow matches discovered input → abstract → output feature flow
Planning as metastability	Anthropic ↔ Carson	Multiple candidate features held simultaneously = metastable basin before commitment
Default circuits as max entropy	Anthropic ↔ Yang	"Can't answer" defaults = maximum entropy; known-entity features reduce entropy

Thread by @sakeeb.rahman

Sakeeb91/seven-papers-transformer-internals.md

Select an option

No results found