"Not All Language Model Features Are One-Dimensionally Linear" and "Sparse Feature Circuits" Summaries and Analyses

Outline

"Not All Language Model Features Are One-Dimensionally Linear"

High-Level Summary

This paper challenges the assumption that every meaningful internal feature in a language model can be represented as a single direction or neuron activation. The authors show that some model features are inherently multi-dimensional, meaning they require a combination of neurons or directions to express a concept. Using GPT-2 and the Mistral 7B model as case studies, they discover circular feature representations for concepts like the days of the week and months of the year (Not All Language Model Features Are Linear | OpenReview). In simple terms, the model encodes these concepts not along one axis but on a two-dimensional plane forming a loop (a circle). They demonstrate that the model actually uses these circular representations to perform reasoning tasks (like calculating what day comes a certain number of days after another day). This insight reveals that language models might internally represent some concepts in richer geometric forms than previously thought, which has implications for how we understand and interpret their reasoning.

Technical Deep Dive (Methods, Findings, Implications)

Methodology: The authors formalize what it means for a feature to be irreducibly multi-dimensional. A feature is considered multi-dimensional if it cannot be broken down into independent or non-overlapping one-dimensional features (Not All Language Model Features Are Linear | OpenReview). In other words, if no single neuron or linear combination can capture the feature’s behavior alone, it’s truly multi-dimensional. To automatically discover such features in a network’s activations, they use sparse autoencoders (SAEs) – neural networks trained to compress and reconstruct the model’s hidden activations with a sparsity constraint. This process yields a dictionary of candidate features (directions in activation space) that the model actually uses during processing (Not All Language Model Features Are Linear | OpenReview). By analyzing these learned feature vectors, the authors cluster and identify sets of neurons that work together, hinting at multi-dimensional structures.

Key Findings:

The SAE method revealed several interpretable 2D feature subspaces. Notably, they found that hidden activations corresponding to weekdays and months are arranged in a loop (topologically like a circle) when projected onto two dimensions (). For example, the representation for "Monday" smoothly transitions through the week and wraps back around to "Sunday" and then "Monday" again, rather than lying on a straight line. Likewise, months January through December form a continuous ring in the activation space (). These circular features are intuitive because days of the week and months of the year have a natural cyclical structure.
They identified tasks in which the language model uses these circular features. For instance, the model is prompted with questions like “Two days from Monday is ...” (a simple calendar arithmetic problem). The authors show that the model’s solution to this involves leveraging the day-of-week circle internally (Not All Language Model Features Are Linear | OpenReview). Essentially, the model "moves along" the circle of days to arrive at the correct answer (Wednesday in this case). Similarly, for months (e.g., "four months from January"), the model navigates the month circle.
Through intervention experiments, the paper provides causal evidence that these multi-dimensional features are not just incidental correlations but actually fundamental to the model’s computation on those tasks (Not All Language Model Features Are Linear | OpenReview). The authors intervened by editing or ablating the circular subspace in the model’s activations and observed the effects on the model’s output. Remarkably, tweaking only the two-dimensional circular feature (for example, rotating the day-of-week circle or zeroing it out) often had almost the same impact on the model’s answer as intervening on the entire layer’s activations (). This indicates that the information needed to do day arithmetic is largely contained in that small circular subspace. In fact, intervening on this subspace in early layers was enough to disrupt or change the answer, confirming that the model’s mechanism for these tasks heavily relies on the discovered feature ().
They also delved into how these features are represented across the network’s layers. By breaking down hidden states into interpretable components, they examined whether the day-of-week circle remains coherent through the model’s layers or appears only at specific stages (Not All Language Model Features Are Linear | OpenReview). They found evidence that such features are present and continuous across multiple layers in Mistral 7B, suggesting the model maintains this structured representation as it processes the input. In other words, the concept of a “weekday cycle” doesn’t disappear after one layer; it persists (perhaps being refined) as the input moves deeper into the network.

Implications: These findings imply that language models can learn geometrically structured internal representations for certain concepts. The fact that a 7B-parameter transformer naturally learned a circular structure for modular concepts (like days of week) is striking. It suggests that some internal computations (especially those involving cyclical or modular logic) aren’t easily captured by the classic view of single neurons or linear features. Instead, the model may embed such logic in a small subspace that has a clear shape (a circle). This challenges the common practice in interpretability of looking for single "concept neurons" or directions – at least for some types of knowledge, we might need to look for higher-dimensional patterns. It also provides a concrete example of mechanistic interpretability: we have a hypothesis for how the model does day arithmetic (by moving along a learned circle), and experimental intervention confirms this hypothesis ().

Relation to the Second Paper

Both this work and the second paper (“Sparse Feature Circuits”) are part of a growing trend in interpretability research that moves beyond single neurons to more expressive units of analysis. In this paper, the focus is on individual features that are multi-dimensional; in the second, the focus is on how features connect into circuits. They share a methodological foundation – notably the use of sparse autoencoders to discover human-interpretable features in the model’s latent space (Not All Language Model Features Are Linear | OpenReview) (). In fact, one can see the research progression: after identifying meaningful features (including multi-dimensional ones) in the model, a logical next step is to understand how these features interact and cause certain behaviors. Both papers involve some of the same authors and aim to explain model internals in human-understandable terms, but they zoom in on different levels of the puzzle. “Not All Features Are One-Dimensionally Linear” provides insight that some features themselves require multiple dimensions (like a coordinate system for days), whereas “Sparse Feature Circuits” (the next paper) uses features (including possibly such multi-dimensional ones) as nodes in a graph to explain complete model behaviors. In short, Paper 1 deals with the nature of individual features, while Paper 2 deals with the relationships among many features. They are complementary: the better we understand and isolate meaningful features (as done in Paper 1), the more effectively we can connect them into circuits and manipulate them (as done in Paper 2).

Potential Importance in AI Research

This paper’s contributions are important for several reasons:

Refining our mental model of representations: It provides concrete evidence that not everything in a neural network is neatly factorized into one-dimensional basis directions. AI researchers often conceptualize features as vectors or single directions (e.g., a "gender direction" or a "sentiment neuron"). Showing that some features are essentially planes or rings in the activation space means we may need richer descriptive frameworks. This could influence future interpretability research to look for shapes (planes, circles, toroids, etc.) in activation space rather than just directions.
Advancing mechanistic interpretability: By rigorously defining and identifying multi-dimensional features, the work pushes the field toward explaining how models encode and use information. It’s a step beyond surface-level correlation; the authors actually pinpoint part of the network’s mechanism for modular arithmetic. Such precise explanations are relatively rare for large-scale models, so this serves as a valuable case study of successful reverse-engineering of a model’s internals.
Guiding future feature discovery: The methods developed (like the sparse autoencoder technique combined with the irreducibility test) can be applied to other models and domains. Researchers aiming to find hidden structures (say in vision models or larger language models) might adopt similar techniques to uncover multi-dimensional features elsewhere. This could lead to discoveries of other complex geometric structures that neural networks learn (for example, loops for other cyclical data, grids for 2D relationships like chess boards, etc.).
Theoretical insight into superposition: There’s a known concept that neural networks superpose multiple features into the same neurons due to limited dimensionality (multiple concepts “share” neurons). This paper refines that idea by suggesting the superposition might sometimes be structured (e.g., two concepts forming a plane rather than being hopelessly entangled). Understanding that structure could help in designing networks or training regimes that avoid pathological superposition and instead encourage interpretable combinations.

Future Research Directions

The findings open up several avenues for further research:

Discovering other multi-dimensional features: Days and months are intuitive examples of inherently circular concepts. It would be interesting to search for other multi-dimensional features in language models. For instance, do models represent seasons, compass directions, or musical notes (which also have circular structure) in a similar way? What about more abstract concepts like emotions or story arcs – are they one-dimensional intensities or multi-dimensional spectra? Future work could apply the same methodology to find out.
Higher-dimensional structures: This paper focused on 2D features (circles). But irreducible features could be higher dimensional (3D or more). A future direction is to identify if models have “cones”, “spheres”, or other manifolds as internal features for certain complex concepts. For example, a 3D cyclical feature might be used for something like encoding time of day (which involves hour, minute, second – a torus-like structure). Developing techniques to visualize and validate those would be challenging but valuable.
Implications for model architecture: If some computations require multi-dimensional features, could architectures be modified to encourage or make use of this? For example, one might design modules that explicitly learn circular embeddings for known cyclical data (like an inductive bias). Research might explore if providing a model with a structured subspace (like a 2D plane for dates) improves learning or interpretability.
Dynamic tracking of features across layers: The paper hints at analyzing continuity of the feature through layers. A thorough investigation could track when and where in a model a certain multi-dimensional feature “comes together” and when it’s used. This could involve tools to trace a concept from input embedding to output. It might reveal, for example, that the model assembles the day-of-week circle in middle layers and then uses it, which informs us about layer roles.
Automating feature identification: While the authors did automate a lot of the discovery, some manual inspection was used to interpret the clusters as days, months, etc. Future work could aim to automatically label or recognize the meaning of discovered multi-dimensional features (perhaps by correlating with vocabulary or known data generatively). This would be necessary if we want to scale up and catalog thousands of features without human in the loop for each.

Applications and Broader Implications (Interpretability, Safety, etc.)

Understanding that language models learn multi-dimensional features has direct and indirect implications for AI practice:

Interpretability Tools: This work can improve interpretability dashboards or tools. For instance, instead of showing importance of single neurons for a prediction, tools might highlight a small set of neurons (a feature subspace) as a unit. Knowing about multi-dimensional features like the day-of-week circle, an interpretability interface could detect when the model is “spinning” that circle to answer a question. This provides a more faithful explanation of the model’s reasoning process to users or developers.
Model Debugging and Correction: If a model is making mistakes on tasks involving a certain concept, understanding the internal representation can help debug it. For example, if a model struggled with date calculations, we might inspect its day-of-week circle for irregularities. Perhaps the circle isn’t well formed or a certain transition is broken; knowing that gives a targeted angle to improve the model (maybe retrain on specific sequences, or even directly adjust the representation if possible).
Safety and Reliability: Interpretability is a key component of AI safety. By revealing that some computations are localized in small subspaces, this research suggests potential leverage points for interventions. If a dangerous or undesired behavior in a model is mediated by a particular multi-dimensional feature, we could in principle monitor or modify that feature to prevent the behavior. For example, if a model had an internal "user identity" feature spanning multiple neurons that caused it to leak private info, identifying and intervening on that could mitigate privacy risks.
Better Concept Understanding for Alignment: In AI alignment, we want models to follow human-intended reasoning. If certain reasoning processes (like arithmetic or logical deduction) correspond to clear geometric features, we could verify whether the model is using the correct process for a given task. The presence of a correct internal representation (like a consistent circular timeline for date reasoning) might be a sign that the model is “thinking” about the task in an intended way, rather than using some unintended shortcut.
Inspiration for Neuroscience and Cognitive Science: Interestingly, the idea of multi-dimensional concept representation mirrors theories in cognitive science (e.g., the mental “concept space” humans have for colors is 3D – brightness, hue, saturation, rather than a single number). Discovering such structures in ANN models might lead to cross-pollination between neuroscience and AI: neuroscientists might wonder if animal brains also encode certain periodic notions in loops or rings of neural activity (which has been hypothesized for things like head direction cells in rats, for example). Conversely, AI might borrow more ideas from how brains handle multi-dimensional representations of concepts.

"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models"

High-Level Summary

This paper introduces a framework to break down a language model’s behavior into interpretable pieces (features) and connections between them (circuits). Think of it as reverse-engineering a neural network: the goal is to find a small network of meaningful components that explains a particular behavior of the model. Traditional interpretability often looks at whole neurons or attention heads, but those can be hard to interpret because each neuron might mix many themes (they are polysemantic). Instead, the authors use features discovered by a sparse autoencoder (which tend to correspond to clearer concepts like a grammatical role, a topic, or a specific pattern in text) as the building blocks (). Then they search for causal relationships among these features to form what they call sparse feature circuits – essentially a graph of a few features that together cause a certain model output or internal behavior (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | OpenReview). They demonstrate that these feature circuits can explain surprising behaviors of language models and can even be edited to change the model’s behavior in a controlled way. One highlight is a technique called SHIFT (Sparse Human-Interpretable Feature Trimming) where humans identify irrelevant features in a circuit (like a spurious correlation the model latched onto) and remove them, thereby improving the model’s performance on that task (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | OpenReview). The paper also shows an automated pipeline that scales this analysis up: they discovered thousands of such circuits without manual supervision, hinting at a path toward systematically understanding large neural networks piece by piece (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | OpenReview).

Technical Deep Dive (Methods, Findings, Implications)

Methodology: The approach can be broken down into two major parts – (1) Finding interpretable features, and (2) Assembling those features into causal circuits. For the first part, the authors leverage dictionary learning via sparse autoencoders (SAEs), similar to the method used in the first paper. Essentially, they feed the language model’s activations into an autoencoder that tries to compress and reconstruct those activations. By imposing a sparsity constraint, the autoencoder learns a set of basis vectors (features) such that each model activation can be expressed as a sparse combination of these basis features (). Prior work (e.g., by Bricken et al. 2023) showed that such features tend to align with human-interpretable concepts. Thus, instead of looking at raw neurons, they now have a library of cleaner semantic features (for example, one feature might activate for tokens that are inside quotes, another might correspond to a specific topic or grammatical structure, etc.).

For the second part, the authors need to find which of these features matter for a specific model behavior and how they connect. They define a model behavior in practice as something like a particular output the model produces or a class of outputs (it could be an unwanted behavior, a capability, or any phenomenon of interest). Then they use a combination of causal probing and attribution techniques to locate a small subnetwork of features responsible for that behavior (). Concretely, they use linear approximations and influence measures (citing methods like integrated gradients and feature attribution) to estimate how much each feature contributes to the behavior (). Features that have a high causal effect on the outcome are selected as nodes in the circuit. Next, to find connections between features, they likely examine how activating one feature influences the activation of another at later layers or how combinations of features jointly affect the outcome. This could involve techniques like path patching (testing causal effect of intervening on one feature while observing another, as suggested by references to prior work (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models)). The end result is a sparse graph: each node is a feature (not a single neuron, but an interpretable direction in activation space), and edges indicate a causal relationship (one feature’s activation leads to or enables another feature’s activation or directly affects the output). Importantly, these circuits are sparse – only a handful of features – making them small enough for a human to understand.

Key Findings and Examples:

The authors show that circuits built from these features can successfully explain model behaviors that were previously “mysterious.” For example, one discovered circuit dealt with a pattern similar to the classic induction head behavior (where the model completes sequences like A, A, B, ... with B). In their feature-level description, they found a “narrow induction feature” that recognizes a repeating token pattern like "A3 ... A3" (where a token appears, then some content, then the same token again) (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models). This feature effectively captures the model noticing a repetition. The circuit connects this to a “succession feature” which appears to implement the logic “if something was repeated, predict it will increment next.” Together, these features form a circuit that produces the outcome "... A4" after seeing "A3 ... A3", meaning the model is copying and incrementing the sequence (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models). This is a fine-grained, interpretable explanation of a task that a whole attention head was originally believed to handle opaquely – now we see it as a logical composition of two understandable parts.
Another example circuit they highlight involves grammar: one feature activates after verbs that can be followed by an infinitive (like "wanted" or "decided") and thereby promotes the token "to" in the next position; a second feature detects when a verb or preposition requires an object and also promotes "to" as the next word (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models). In combination, these two features explain how the model decides to insert the word "to" in a sentence. This kind of insight is more satisfying than saying “Neuron 4567 fires for 'to'” – instead we have a mini-rule: Feature1 (context says an infinitive might follow) AND Feature2 (object needed) => suggest 'to'.
The paper introduces SHIFT (Spurious Human-Interpretable Feature Trimming) as a practical application of these circuits. In a case study, they took a language-model-based classifier that predicts a person’s profession from their biography (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models). Without guidance, the classifier picked up a spurious correlation: it used the person’s gender as a strong hint for the job (because in the training data, for example, most nurses were women and most engineers were men – a stereotypical bias in the dataset). Using their method, the authors discovered a feature circuit in the classifier that included a “gender feature” spuriously influencing the profession prediction. With SHIFT, a human reviewer looks at the circuit, identifies that the gender feature is not actually relevant to the real task, and then ablates (removes) that feature’s influence () (). They then retrain or fine-tune the classifier slightly without that feature. The result was that the classifier no longer relied on gender at all, and its accuracy on a balanced evaluation (where gender and profession are not correlated) jumped to match an oracle model that had been trained on unbiased data (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models). In other words, SHIFT surgically removed the dependency on a bias without needing a special balanced dataset ahead of time. This is a powerful demonstration: interpretability wasn’t just for explaining the model – it was used to fix the model. The paper reports that in this biased scenario, the SHIFT-ed model achieved the same performance as if it had been trained on a perfectly balanced dataset (which the authors call the “Oracle” baseline) (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models).
Finally, the authors built an unsupervised pipeline to scale up the discovery of feature circuits. They automatically generated a large set of behaviors to analyze by clustering the model’s outputs and activation patterns (using a method from an earlier work by Michaud et al., 2023). For each cluster of behaviors (each representing an “interesting model behavior” that emerged from data, without a human initially specifying it), they applied their feature-circuit discovery process. Impressively, they uncovered thousands of circuits this way, without manual intervention for each (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | OpenReview). They even provide a browser (feature-circuits.xyz) where one can explore these circuits. This is a significant step toward a more comprehensive interpretability: rather than studying one behavior at a time, we could aim to map out a large portion of the model’s internal logic automatically. The example circuits mentioned above (the induction-like pattern, the grammar rule) were among those discovered in this automated fashion (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models) (Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models). It shows that the approach is general enough to find both low-level circuits (like adding "to") and higher-level ones (like copying sequences), covering a wide range of model phenomena.

Implications:

The concept of “feature circuits” marries two important ideas in interpretability: disentangled features and causal circuits. By working at the level of features that are already human-meaningful, the resulting circuits are far more interpretable than circuits made of raw neurons or attention heads. This means we can potentially trust and verify these explanations more easily. For example, when they present a circuit, each node can be described in plain language (like "gender signal" or "repeat-sequence detector") instead of just an ID number of a neuron. This interpretability is crucial if such analyses are to be used in real-world auditing of AI systems.
The success of SHIFT demonstrates a new paradigm: using interpretability to improve models. Often interpretability is separate from model training and deployment (we analyze a trained model to understand it). Here, they closed the loop by taking an interpretation (the circuit with an undesired feature) and then modifying the model accordingly. This could inspire new techniques in model training where interpretability acts as a regularizer or a repair mechanism. For instance, one could imagine training a model and periodically “pruning” any spurious feature circuits that appear, thus guiding the model to use more robust strategies.
The fact that an unsupervised pipeline can find thousands of these circuits is both exciting and a bit daunting. It suggests that inside a large language model, there is a vast web of feature interactions implementing myriad functions. Having a way to automatically map these out could eventually lead to a fuller understanding of a model’s internals, almost like mapping a genome. However, it also raises the question of how to organize and make sense of so many circuits – perhaps an area for future research (e.g., ranking them by importance, grouping them into modules, etc.).

Relation to the First Paper

As mentioned earlier, this work is complementary to “Not All Language Model Features Are One-Dimensionally Linear.” Both studies use sparse autoencoders to find meaningful features in a language model (Not All Language Model Features Are Linear | OpenReview) (), highlighting a convergence in interpretability techniques. The first paper focuses on the existence of complex single features (multi-dimensional ones like the day-of-week circle), whereas this second paper uses a web of features to explain behaviors. One way to see the relationship: if the first paper is saying “some concepts live in 2D (or multi-D) spaces, not just lines,” the second paper could readily incorporate that insight by allowing features in its circuits that are multi-dimensional (since the features come from the same kind of SAE dictionary). In fact, the authors of the second paper likely benefited from the improved feature discovery techniques discussed in the first. They both aim to break the model’s computation down into understandable parts – Paper 1 zooms in on what a part can look like (and finds it can be a plane, not just a line), and Paper 2 zooms out to see how parts connect into an algorithm. They share the ultimate goal of opening the “black box” of neural networks in a rigorous way. Practically, insights from Paper 1 about multi-dimensional features could feed into the pipeline of Paper 2: for example, recognizing that a circular feature (like an internal date representation) might be a node in a circuit that handles date calculations or temporal reasoning in the model. Both works exemplify the trend of feature-level interpretability (as opposed to neuron-level), marking a shift in how researchers dissect neural nets.

Potential Importance in AI Research

This paper’s approach and findings are significant on multiple fronts:

Scalability of Interpretability: One of the biggest challenges in interpretability is how to scale analyses to models with billions of parameters and countless behaviors. By automating the discovery of circuits and showing it can handle thousands of cases, this work is a step toward scalable interpretability. It hints that we might not have to manually discover every circuit of interest; instead, tools could systematically surface the most relevant circuits for an analyst to review. This moves us closer to the ambitious goal of fully auditing a large model’s decision-making process.
Causality and Understanding: The emphasis on causal graphs (not just correlational patterns) is crucial. Many prior interpretability findings are correlational (e.g., “when neuron X is active, output Y often happens”). By identifying circuits that are causally implicated – meaning if you intervene and break that circuit, the behavior changes – the paper ensures these interpretations reflect actual mechanisms () (). This rigorous approach strengthens the scientific foundation of interpretability research, making the explanations more trustworthy.
Integration of Techniques: The paper effectively combines techniques from different areas: unsupervised feature learning, graph/causal analysis, and human-in-the-loop editing. This integration is important because real-world interpretability solutions will likely need a mix of automated and human-guided steps. The method provides a blueprint for how to weave these together. As such, it could inspire other researchers to build on this pipeline, perhaps improving each component (better feature learning algorithms, more efficient circuit search, etc.).
Benchmark for Mechanistic Explanations: By demonstrating concrete circuits for known phenomena (like the induction example or grammar features), it provides case studies that others can reference and build upon. It’s a form of verification that these methods work: we now have several vetted examples where we understand “why the model did X” in terms of a small network of features. This is valuable for the interpretability community – each such example is a proof that complex model behaviors can be understood, which encourages tackling the next, perhaps more complicated, behavior.
Cross-Disciplinary Impact: The idea of simplifying a complex model into an interpretable causal graph may also influence fields like neuroscience (where one might want to find circuits of neurons explaining brain functions) or cognitive science (explaining human decision processes in terms of latent feature circuits). Conversely, ideas from those fields about modular organization and causal pathways could be applied back to analyzing AI models, prompted by works like this.

Future Research Directions

The paper opens many exciting questions and follow-ups:

Generalizing to Other Models and Modalities: While demonstrated on language models, the concept of sparse feature circuits could be applied to other neural networks. Future research might try to discover feature circuits in vision models (e.g., circuits of features that cause a CNN to recognize a texture vs. an object) or in multi-modal models (where a circuit might connect a visual feature and a text feature to drive a caption output). Adapting the technique to these domains might require new types of interpretable features (for images, features might correspond to patches or patterns rather than words).
Interactive Model Debugging Tools: One could build interactive tools for model developers using the ideas here. For example, a future direction is a “circuit inspector”: given a model and some behavior, the tool automatically shows the hypothesized feature circuit. The developer could then toggle features on/off (similar to SHIFT but in a user interface) and see how outputs change. Developing such a tool would require refining the algorithms for real-time use and ensuring the circuits found are reliably correct. It also raises research questions about the best ways to visualize and explain these circuits to users who are not experts in interpretability.
Enhancing Training with Circuits: Another direction is incorporating circuit knowledge into training or fine-tuning. For instance, one could imagine a training procedure that penalizes the model if it develops certain unwanted circuits or if the circuits for a task are too complex (to encourage simpler, more interpretable strategies). Conversely, one might encourage certain circuits that correspond to desirable reasoning. This blends interpretability with techniques in model editing and regularization. Designing such objectives and testing if they lead to more robust or alignable models would be a valuable exploration.
Library of Found Circuits: As thousands of circuits are discovered, an important research task is to organize and categorize them. Future work could focus on clustering circuits, identifying higher-level common patterns (maybe many circuits implement some form of copying mechanism, or many circuits in language models correspond to grammar rules, etc.). This could lead to a taxonomy of circuits in language models. Such a taxonomy would deepen our theoretical understanding: e.g., it might reveal that large language models, no matter how trained, always develop a certain set of fundamental circuits (for tasks like copying, translation, factual recall, etc.). Researchers could also compare circuits across model sizes or architectures to see how inductive biases affect the learned circuits.
Causal Validation and Refinement: As with any interpretability method, there’s a need to continually validate that the circuits identified are truly the ones used by the model. Future work might refine the causal discovery aspect – for example, using more sophisticated tools from causal inference to rule out false positives in the circuit graph. There’s also room to explore nonlinear interactions (the current approach leans on linear approximations ()). Perhaps some circuits involve features that combine nonlinearly to affect the output. Capturing those would require extending the methods beyond linearity, maybe using conditional interventions or Boolean circuit approximations of the model’s logic.

Possible Applications and Broader Implications

This research has strong implications for interpretability, AI safety, and the design of AI systems:

Model Auditing and Safety: Regulators or ethics committees concerned with how an AI makes decisions could use sparse feature circuits to audit models. For example, before deploying a language model in a medical advice setting, one might analyze circuits related to how it handles questions about medication. If a circuit reveals that a certain irrelevant feature (say, presence of a rare word) unduly influences medical advice, that’s a red flag to address. The ability to pinpoint a mechanism and edit it (as with SHIFT) is incredibly valuable for safety – it’s not just identifying the problem, but also providing a means to fix the model without a complete retrain. This surgical editability could prevent specific failure modes (like the model relying on racial cues for crime prediction, or any other undesirable heuristic).
Bias and Fairness Mitigation: The SHIFT method directly showcases an application in removing bias. This could be generalized to many other biases: for instance, if a translation model has a gender bias in pronoun predictions, one could find the gender feature circuit and adjust it. It’s a proactive way to enforce fairness constraints by intervening on the model’s internal representation. Compared to traditional debiasing (which might require lots of balanced data or adversarial training), this feature-level intervention might be more data-efficient and targeted.
Improving Robustness and Generalization: Beyond ethical biases, models often latch onto spurious correlations that hurt performance on out-of-distribution data. By discovering circuits, we might detect shortcut features the model uses (like text length as a proxy for something, or presence of specific punctuation as a signal). Removing or weakening those could make the model rely on more fundamental reasoning, thereby improving its robustness when faced with unusual inputs. Essentially, feature circuits give us a window into “what the model is paying attention to.” If some of those things are brittle, we can adjust the model to pay attention to more reliable features.
Transparency and Trust: For AI systems deployed in high-stakes areas, being able to explain decisions is crucial for user trust. Sparse feature circuits can form the backbone of explanations that are both honest and understandable. Instead of a vague feature importance list, an explanation could be, for example: “The model formed its answer based on features X, Y, Z interacting. Feature X corresponds to 'symptom mentions', Y to 'family history', and Z to 'disease likelihood'. The presence of symptoms and family history together triggered the prediction of disease.” Such an explanation, derived from a circuit, is more narrative and satisfying than raw statistics. It could be given to doctors using an AI diagnostic tool, for instance, to double-check the AI’s reasoning.
Research on Network Design: In the long term, understanding circuits might influence how we design neural network architectures. If we learn that certain circuits are always useful, we might build modules explicitly for them. Conversely, if certain problematic circuits keep emerging, we might adjust architecture or training to make them less likely. The concept of modular design in neural networks could be informed by the circuits we discover: perhaps future models will have more explicit separations that mirror the circuits (like dedicated sub-networks for certain functions) to make them more interpretable by design.
Alignment with Human Values: On the path to aligning AI with human values, one proposal is to deeply understand what goals and reasoning patterns a model has. Feature circuits provide a granular view of how a model is thinking. If an AI were to develop an undesirable subgoal or deceptive strategy, ideally it would show up as a sub-circuit within its larger decision-making graph. Being able to spot that and intervene (either by editing or by training feedback) could prevent misaligned behavior. This is speculative and very challenging, but this paper’s methodology is a step toward being able to inspect and modify the “cogs” in the giant machinery of a neural network, which is highly relevant for alignment work.

Conclusion

Together, both papers represent a significant advance in mechanistic interpretability of AI models. The first reveals that the building blocks of model knowledge can be richer than we assumed (not just single directions, but small subspaces with structure), and the second shows how to wire together these building blocks to form understandable explanations and even improve the model. They reinforce each other and paint a picture of a future where we can systematically dissect neural networks: identifying their concepts and features, mapping out the circuits of interaction, and tweaking those circuits to ensure the model behaves in desirable ways. This line of research is likely to play a crucial role in developing AI systems that we can trust, align, and deeply understand.

ajcwebdev/index.md

Outline