Skip to content

Instantly share code, notes, and snippets.

@Sakeeb91
Last active March 23, 2026 10:52
Show Gist options
  • Select an option

  • Save Sakeeb91/79fe75658be60fbd1691d12c29d38366 to your computer and use it in GitHub Desktop.

Select an option

Save Sakeeb91/79fe75658be60fbd1691d12c29d38366 to your computer and use it in GitHub Desktop.

Seven Papers Converging on the Same Theory of Transformer Internals

Seven independent research programs β€” biology, statistical physics, neuroscience, cognitive science, pure mathematics, dynamical systems β€” all converging on the same structural description of how transformers compute. Different formalisms. Same underlying object.

Full isomorphism analysis in the thread. Here are the sources.


1. On the Biology of a Large Language Model

Lindsey, Gurnee, Ameisen, Chen, Pearce, Turner, Citro et al. β€” Anthropic (2025)

Circuit tracing through Claude 3.5 Haiku using attribution graphs built on cross-layer transcoders. Case studies on multi-step reasoning, planning in poems, multilingual circuits, addition, medical diagnosis, hallucinations, refusals, jailbreaks, chain-of-thought faithfulness, and hidden goals in misaligned models.

πŸ”— transformer-circuits.pub/2025/attribution-graphs/biology.html


2. Physics of Language Models

Zeyuan Allen-Zhu β€” MIT / ICML 2024 Tutorial

Controlled synthetic experiments to discover universal laws governing all LLMs. Breaks "intelligence" into dimensions (structure, reasoning, knowledge, architecture), trains hundreds of small models under varied conditions to find laws that hold regardless of scale.

πŸ”— physics.allen-zhu.com


3. The Neuroscience of Transformers

Peter KΓΆnig & Mario Negrello (2026)

Maps the transformer encoder/decoder architecture onto the laminar structure of the cortical column. Values β†’ L4 feedforward pathway, Keys β†’ tangential L2/3 stream, Queries β†’ L1 contextual feedback, attention weights β†’ dendritic coincidence detection. Generates a full table of testable predictions.

πŸ”— arxiv.org/abs/2603.15339


4. A Statistical Physics of Language Model Reasoning

Jack David Carson & Amir Reisizadeh β€” MIT (2025)

Models sentence-level hidden-state trajectories as a stochastic dynamical system (ItΓ΄ SDE) with latent regime switching. Finds four reasoning regimes on a rank-40 manifold capturing 50% variance. SLDS achieves RΒ² β‰ˆ 0.68 on held-out trajectories. Validated on adversarial belief manipulation across 8 models and 7 benchmarks.

πŸ”— arxiv.org/abs/2506.04374


5. Large Language Models and Cognitive Science: A Comprehensive Review

Qian Niu, Junyu Liu, Ziqian Bi et al. (2024, updated 2025)

Survey of LLM-cognition comparisons across language processing, reasoning, memory, sensory judgment, and cognitive biases. Covers evaluation methods, LLMs as cognitive models, integration with cognitive architectures, and limitations.

πŸ”— arxiv.org/abs/2409.02387


6. Entropy, Thermodynamics and the Geometrization of the Language Model

Wenzhe Yang (2024)

Pure mathematics framework for language models using set theory, analysis, and statistical mechanics. Defines the moduli space of distributions, entropy function, partition function, internal energy, Helmholtz free energy, and the Boltzmann manifold. Argues zero-entropy points explain why LLMs need billions of parameters. Formulates an AGI conjecture.

πŸ”— arxiv.org/abs/2407.21092


7. Transformer Dynamics: A Neuroscientific Approach to Interpretability

Jesseba Fernando & Grigori Guitchounts β€” Northeastern / Independent (2025)

Treats the residual stream of Llama 3.1 8B as a dynamical system evolving across layers. Finds accelerating velocity, decreasing mutual information, unstable periodic orbits (~10.74 rotations per unit), and attractor-like self-correcting dynamics in lower layers. Proposes a "neuroscience of AI" combining dynamical systems theory with mechanistic interpretability.

πŸ”— arxiv.org/abs/2502.12131


Key Isomorphisms Across Papers

Connection Papers What converges
Boltzmann duality Yang ↔ Anthropic Entropy function and attribution graphs are formally dual operations on the same distribution
Orthogonal dynamical axes Carson ↔ Fernando Same multi-basin structure sliced across reasoning steps vs. across layers
Cortical prediction matches empirical finding KΓΆnig ↔ Anthropic Predicted laminar flow matches discovered input β†’ abstract β†’ output feature flow
Planning as metastability Anthropic ↔ Carson Multiple candidate features held simultaneously = metastable basin before commitment
Default circuits as max entropy Anthropic ↔ Yang "Can't answer" defaults = maximum entropy; known-entity features reduce entropy

Thread by @sakeeb.rahman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment