Mechanistic Interpretability: Research Vault (Updated March 2026)

1. Core Frameworks & Theory (Surveys & Foundational Concepts)

A Practical Review of Mechanistic Interpretability (2025)
- ArXiv Abstract (2407.02646) — Comprehensive taxonomy and guide for the entire field.
- Direct PDF
- Full HTML Version
Open Problems in Mechanistic Interpretability (2025)
- ArXiv Abstract (2501.16496) — Consensus from 29 researchers (DeepMind, Anthropic, FAR.ai, etc.) on major roadblocks.
- Direct PDF
- FAR.ai Deep Dive
Mechanistic Interpretability for AI Safety — A Review (2024)
- ArXiv Abstract (cross-referenced in surveys)
- Direct Blog/HTML — Roadmap for safety applications.
Mechanistic Indicators of Understanding in LLMs (2026)
- ArXiv Abstract (2507.08017) — Tiered framework for LLM "thinking" and understanding.
- Direct PDF
A Mathematical Framework for Transformer Circuits (2021)
- Interactive Paper — Classic: residual stream as communication channel.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Anthropic, 2024)
- Official Interactive Demo — Millions of features from production-scale model.
- Anthropic Research Blog
Towards Monosemanticity: Decomposing LMs with Dictionary Learning (2023)
- Transformer Circuits Paper — Foundational SAE breakthrough.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (2024)
- ArXiv Abstract (2309.08600)
- Direct PDF
Encourage or Inhibit Monosemanticity? (2024)
- ArXiv Abstract (2406.17969) — Debate on interpretability vs. performance.
- Direct PDF
Recent SAE Advances (2025): TopK SAEs, JumpReLU SAEs, Gated SAEs, Matryoshka SAEs, End-to-End SAEs (see Practical Review for taxonomy).

Interpreting Transformers via Attention Head Intervention (2026)
- ArXiv Abstract (2601.04398) — Specific heads driving behavior.
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (2023)
- ArXiv Abstract (2211.00593)
- Direct PDF — Classic path-patching example.
In-Context Learning and Induction Heads (2022)
- Transformer Circuits Paper — Universality of induction circuits.
Modular Reasoning with Brain-Like Specialization (2025)
- ArXiv Abstract (2506.13331) — CoT and sparsity analysis.
- Direct PDF
Distilling Machine-Learned Algorithms into Code (2024)
- MIT DSpace Record
- Direct PDF
Towards Automated Circuit Discovery (ACDC) (2023)
- ArXiv Abstract (2304.14997)
- Direct PDF
Progress Measures for Grokking via Mechanistic Interpretability (2023)
- ArXiv Abstract (2301.05217) — Circuit formation during training.

Mechanistic Interpretability for LLM Alignment (2026)
- ArXiv Abstract (2602.11180) — Circuit discovery to steering for safety.
Locating and Editing Factual Associations in GPT (2022)
- ArXiv Abstract (2202.05262) — Activation patching classic.
Causal Scrubbing: Rigorously Testing Interpretability Hypotheses (2022)
- LessWrong Post
Mechanistic Interpretability of Code Correctness / Catastrophic Forgetting (2025–2026 papers) — See recent arXiv for continual fine-tuning circuits.

TransformerLens — Gold standard for mechanistic interpretability of GPT-style models.
- GitHub
- Demo Colab
nnsight — Powerful library for interpreting and intervening in model internals.
- GitHub
Sparse Autoencoder Library (AI Safety Foundation)
- GitHub
CircuitsVis — Beautiful attention & circuit visualizations.
- GitHub
Arrakis — Full suite for tracking MI experiments.
- GitHub
Transformer Debugger (OpenAI) — Interactive debugging with SAEs.

Awesome Mechanistic Interpretability LM Papers (Dakingrai)
- GitHub — Best taxonomy-based collection (follows Practical Review).
Awesome-LMMs-Mechanistic-Interpretability
- GitHub — Multimodal focus.
Awesome Mechanistic Interpretability (gauravfs-14)
- GitHub — Auto-updated with latest arXiv.
Neel Nanda’s Favourite Papers (Opinionated List)
- Full List

Neel Nanda (Google DeepMind) — Glossary, reading lists, TransformerLens creator.
- Main Hub
Anthropic Interpretability Team
- Research Page
- Transformer Circuits Hub
FAR.AI — Open problems and safety-focused MI.
- Research

What Matters Right Now in Mechanistic Interpretability? (Neel Nanda, 2025)
- YouTube
A Walkthrough of A Mathematical Framework for Transformer Circuits (Neel Nanda)
- YouTube
Neel Nanda on Mechanistic Interpretability
- YouTube Overview

Polysemanticity: One neuron firing for many unrelated concepts.
Superposition: Packing more features than dimensions in latent space.
Circuit: A subgraph of weights/heads performing a discrete algorithm.
Logit Lens: Peeking at model "thoughts" at intermediate layers.
Activation Patching: Causal intervention by replacing activations to test hypotheses.
Feature: A human-interpretable direction in activation space.
Induction Head: Key motif for in-context learning (copies previous token patterns).
Universality: Hypothesis that similar circuits appear across models/tasks.
Microscope AI: Using reverse-engineered internals of a powerful model without deploying it.
Faithfulness: How well an interpretation explains actual model behavior.