Mechanistic interpretability—the effort to reverse-engineer neural networks into human-understandable components—has reached a critical inflection point. A landmark collaborative paper published in January 2025 by 29 researchers across 18 organizations established the field's consensus open problems, while MIT Technology Review named the field a "breakthrough technology for 2026." Yet despite genuine progress on circuit discovery and feature identification, fundamental barriers persist: core concepts like "feature" lack rigorous definitions, computational complexity results prove many interpretability queries are intractable, and practical methods still underperform simple baselines on safety-relevant tasks.
The field is split between Anthropic's ambitious goal to "reliably detect most AI model problems by 2027" and Google DeepMind's strategic pivot away from sparse autoencoders toward "pragmatic interpretability." This tension—between