-
A Practical Review of Mechanistic Interpretability (2025)
- ArXiv Abstract (2407.02646) — Comprehensive taxonomy and guide for the entire field.
- Direct PDF
- Full HTML Version
-
Open Problems in Mechanistic Interpretability (2025)
- ArXiv Abstract (2501.16496) — Consensus from 29 researchers (DeepMind, Anthropic, FAR.ai, etc.) on major roadblocks.
- Direct PDF
- FAR.ai Deep Dive
-
Mechanistic Interpretability for AI Safety — A Review (2024)
- ArXiv Abstract (cross-referenced in surveys)
- Direct Blog/HTML — Roadmap for safety applications.
-
Mechanistic Indicators of Understanding in LLMs (2026)
- ArXiv Abstract (2507.08017) — Tiered framework for LLM "thinking" and understanding.
- Direct PDF
-
A Mathematical Framework for Transformer Circuits (2021)
- Interactive Paper — Classic: residual stream as communication channel.
-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Anthropic, 2024)
- Official Interactive Demo — Millions of features from production-scale model.
- Anthropic Research Blog
-
Towards Monosemanticity: Decomposing LMs with Dictionary Learning (2023)
- Transformer Circuits Paper — Foundational SAE breakthrough.
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models (2024)
-
Encourage or Inhibit Monosemanticity? (2024)
- ArXiv Abstract (2406.17969) — Debate on interpretability vs. performance.
- Direct PDF
-
Recent SAE Advances (2025): TopK SAEs, JumpReLU SAEs, Gated SAEs, Matryoshka SAEs, End-to-End SAEs (see Practical Review for taxonomy).
-
Interpreting Transformers via Attention Head Intervention (2026)
- ArXiv Abstract (2601.04398) — Specific heads driving behavior.
-
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (2023)
- ArXiv Abstract (2211.00593)
- Direct PDF — Classic path-patching example.
-
In-Context Learning and Induction Heads (2022)
- Transformer Circuits Paper — Universality of induction circuits.
-
Modular Reasoning with Brain-Like Specialization (2025)
- ArXiv Abstract (2506.13331) — CoT and sparsity analysis.
- Direct PDF
-
Distilling Machine-Learned Algorithms into Code (2024)
-
Towards Automated Circuit Discovery (ACDC) (2023)
-
Progress Measures for Grokking via Mechanistic Interpretability (2023)
- ArXiv Abstract (2301.05217) — Circuit formation during training.
-
Mechanistic Interpretability for LLM Alignment (2026)
- ArXiv Abstract (2602.11180) — Circuit discovery to steering for safety.
-
Locating and Editing Factual Associations in GPT (2022)
- ArXiv Abstract (2202.05262) — Activation patching classic.
-
Causal Scrubbing: Rigorously Testing Interpretability Hypotheses (2022)
-
Mechanistic Interpretability of Code Correctness / Catastrophic Forgetting (2025–2026 papers) — See recent arXiv for continual fine-tuning circuits.
-
TransformerLens — Gold standard for mechanistic interpretability of GPT-style models.
-
nnsight — Powerful library for interpreting and intervening in model internals.
-
Sparse Autoencoder Library (AI Safety Foundation)
-
CircuitsVis — Beautiful attention & circuit visualizations.
-
Arrakis — Full suite for tracking MI experiments.
-
Transformer Debugger (OpenAI) — Interactive debugging with SAEs.
-
Awesome Mechanistic Interpretability LM Papers (Dakingrai)
- GitHub — Best taxonomy-based collection (follows Practical Review).
-
Awesome-LMMs-Mechanistic-Interpretability
- GitHub — Multimodal focus.
-
Awesome Mechanistic Interpretability (gauravfs-14)
- GitHub — Auto-updated with latest arXiv.
-
Neel Nanda’s Favourite Papers (Opinionated List)
-
Neel Nanda (Google DeepMind) — Glossary, reading lists, TransformerLens creator.
-
Anthropic Interpretability Team
-
FAR.AI — Open problems and safety-focused MI.
-
What Matters Right Now in Mechanistic Interpretability? (Neel Nanda, 2025)
-
A Walkthrough of A Mathematical Framework for Transformer Circuits (Neel Nanda)
-
Neel Nanda on Mechanistic Interpretability
- Polysemanticity: One neuron firing for many unrelated concepts.
- Superposition: Packing more features than dimensions in latent space.
- Circuit: A subgraph of weights/heads performing a discrete algorithm.
- Logit Lens: Peeking at model "thoughts" at intermediate layers.
- Activation Patching: Causal intervention by replacing activations to test hypotheses.
- Feature: A human-interpretable direction in activation space.
- Induction Head: Key motif for in-context learning (copies previous token patterns).
- Universality: Hypothesis that similar circuits appear across models/tasks.
- Microscope AI: Using reverse-engineered internals of a powerful model without deploying it.
- Faithfulness: How well an interpretation explains actual model behavior.