Skip to content

Instantly share code, notes, and snippets.

@hed0rah
Created March 27, 2026 06:31
Show Gist options
  • Select an option

  • Save hed0rah/138dd352d32bf55374116c33bf6d6d12 to your computer and use it in GitHub Desktop.

Select an option

Save hed0rah/138dd352d32bf55374116c33bf6d6d12 to your computer and use it in GitHub Desktop.
Mechanistic Interpretability: Research Vault

Mechanistic Interpretability: Research Vault (Updated March 2026)

1. Core Frameworks & Theory (Surveys & Foundational Concepts)


2. Feature Extraction & Sparse Autoencoders (SAEs)


3. Circuit Discovery & Reasoning


4. Advanced Techniques & Applications (2025–2026)

  • Mechanistic Interpretability for LLM Alignment (2026)

  • Locating and Editing Factual Associations in GPT (2022)

  • Causal Scrubbing: Rigorously Testing Interpretability Hypotheses (2022)

  • Mechanistic Interpretability of Code Correctness / Catastrophic Forgetting (2025–2026 papers) — See recent arXiv for continual fine-tuning circuits.


5. Tools & Libraries (Essential for Experiments)

  • TransformerLensGold standard for mechanistic interpretability of GPT-style models.

  • nnsightPowerful library for interpreting and intervening in model internals.

  • Sparse Autoencoder Library (AI Safety Foundation)

  • CircuitsVisBeautiful attention & circuit visualizations.

  • ArrakisFull suite for tracking MI experiments.

  • Transformer Debugger (OpenAI)Interactive debugging with SAEs.


6. Community Resources & Awesome Lists

  • Awesome Mechanistic Interpretability LM Papers (Dakingrai)

    • GitHubBest taxonomy-based collection (follows Practical Review).
  • Awesome-LMMs-Mechanistic-Interpretability

    • GitHubMultimodal focus.
  • Awesome Mechanistic Interpretability (gauravfs-14)

    • GitHubAuto-updated with latest arXiv.
  • Neel Nanda’s Favourite Papers (Opinionated List)


7. Key Researchers & Blogs


8. Videos, Talks & Explainers

  • What Matters Right Now in Mechanistic Interpretability? (Neel Nanda, 2025)

  • A Walkthrough of A Mathematical Framework for Transformer Circuits (Neel Nanda)

  • Neel Nanda on Mechanistic Interpretability


Technical Glossary

  • Polysemanticity: One neuron firing for many unrelated concepts.
  • Superposition: Packing more features than dimensions in latent space.
  • Circuit: A subgraph of weights/heads performing a discrete algorithm.
  • Logit Lens: Peeking at model "thoughts" at intermediate layers.
  • Activation Patching: Causal intervention by replacing activations to test hypotheses.
  • Feature: A human-interpretable direction in activation space.
  • Induction Head: Key motif for in-context learning (copies previous token patterns).
  • Universality: Hypothesis that similar circuits appear across models/tasks.
  • Microscope AI: Using reverse-engineered internals of a powerful model without deploying it.
  • Faithfulness: How well an interpretation explains actual model behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment