14 LLM Interpretability

In March 2024, Anthropic published a paper that sent shockwaves through the AI research community. They had trained sparse autoencoders on Claude 3 Sonnet and extracted millions of interpretable features: individual directions in the model's activation space that corresponded to specific concepts. Feature #1 responded to the Golden Gate Bridge. Another activated for code bugs. Another for deception. And when they artificially amplified the Golden Gate Bridge feature, Claude became obsessed with it, inserting references to the bridge into every conversation regardless of the topic. They had found a knob inside the model and turned it.

This is interpretability: the science of opening up neural networks and understanding how they actually compute their outputs, not from the outside (that is explainability, Chapter 8) but from the inside, at the level of individual neurons, attention heads, and circuits. It is arguably the most important subfield of AI safety research, because you cannot align what you do not understand.

The Transformer Circuits Thread

Anthropic's research team (Chris Olah, Nelson Elhage, Neel Nanda, and collaborators) has published an extraordinary series of papers on the Transformer Circuits blog (transformer-circuits.pub). This thread, which started with “A Mathematical Framework for Transformer Circuits” (Elhage et al. 2021) (2021) and continues through “Towards Monosemanticity” (Bricken et al. 2023) (2023) and “Scaling Monosemanticity” (Templeton et al. 2024) (2024), is the single most important body of work in mechanistic interpretability. If you want to understand how transformers actually work internally, not just architecturally, but computationally, start with this thread. It is technical but beautifully written, with interactive visualizations and clear explanations.

14.1 What Is Mechanistic Interpretability?

Mechanistic interpretability (“mech interp” in the community) treats neural networks as programs to be reverse-engineered. Rather than treating the model as a black box and probing it from the outside, mech interp looks at the weights and activations directly to understand the algorithms the model has learned.

The analogy is to reverse-engineering compiled software: given a binary, can you reconstruct the source code? Given a trained neural network, can you reconstruct the algorithms it implements?

Chris Olah's Vision

The goal of mechanistic interpretability is to build “neuroscience for artificial neural networks.” Unlike biological neuroscience, we enjoy perfect access to every neuron, every weight, and every activation. We can run the same input thousands of times with identical results. We can ablate any component and observe the effect. We have every advantage that biological neuroscientists dream of, and yet understanding these networks remains extraordinarily difficult.

14.2 Circuits: The Building Blocks of Computation

A circuit is a subgraph of the model's computation graph that is responsible for a specific behavior. Think of it as a subroutine: a small, identifiable piece of the network that implements a particular function.

The Transformer Circuits thread introduced the framework of analyzing transformers as compositions of attention heads and MLP layers, where each component contributes to the residual stream and downstream components read from it.

14.2.1 Induction Heads: The First Major Discovery

Induction heads are perhaps the most important circuit discovered in transformers. They implement a simple but powerful pattern: given the sequence [A][B]...[A], they predict that [B] will follow the second [A]. This is in-context copying: the model recognizes a pattern it has seen earlier in the context and reproduces it.

Olah et al. showed that induction heads emerge through a two-head composition:

A “previous token” head in an early layer that shifts information one position back, so each token's representation includes information about the previous token.
An “induction” head in a later layer that attends from a current [A] token to previous copies of [A], and, because of the previous-token head, finds [B] at that position.

This two-step composition is significant because it demonstrates that transformers learn algorithms, not just statistical correlations. The induction circuit is a general-purpose copying mechanism that works for any tokens [A] and [B], regardless of what they are.

The Phase Change

Olah et al. observed that induction heads emerge suddenly during training, in a phenomenon they call a “phase change.” Before the phase change, the model relies on simple bigram statistics (it predicts the next token based on the current token alone). After the phase change, in-context learning ability appears abruptly: the model can suddenly copy patterns from earlier in the context. This suggests that induction heads are not just one circuit among many, but a fundamental mechanism underlying the transformer's ability to do in-context learning.

14.2.2 Other Known Circuits

Researchers have identified circuits for various behaviors:

Indirect object identification: Wang et al. (Wang et al. 2023) found circuits in GPT-2 that correctly resolve sentences like “When Mary and John went to the store, John gave a drink to” \(\to\) “Mary.” The circuit involves attention heads that track subjects and objects.
Greater-than: Hanna et al. found a circuit in GPT-2 that determines whether one number is greater than another.
Factual recall: Circuits that retrieve factual information (“The capital of France is [Paris]”) from the model's parameters, involving specific MLP neurons that store facts and attention heads that route queries to the right neurons.

14.2.3 Finding Circuits: Activation Patching

The primary tool for identifying circuits is activation patching (also called “causal tracing” or “interchange intervention”). The idea:

Run the model on a “clean” input (where it produces the correct output).
Run the model on a “corrupted” input (where the answer changes).
One component at a time, replace (“patch”) the corrupted run's activation with the clean run's activation.
If patching a component restores the correct output, that component is causally important.

Automated circuit discovery (Conmy et al. 2023) scales this process: instead of manually testing each component, algorithms systematically search for the minimal subgraph that is sufficient to reproduce the model's behavior on a given task.

14.3 Superposition: The Fundamental Challenge

Here is the central puzzle of interpretability: models represent far more features than they have neurons. A model with \(d\) neurons might encode thousands or millions of distinct concepts. How?

The answer is superposition (Elhage et al. 2022). Features are represented as directions in activation space, and these directions can be almost-orthogonal (but not perfectly orthogonal) to each other. With \(d\) dimensions, you can pack exponentially many almost-orthogonal directions, just as you can pack many more “nearly perpendicular” lines in a high-dimensional space than perfectly perpendicular ones.

The Polysemantic Neuron Problem

Superposition makes individual neurons polysemantic: a single neuron might activate for “academic citations,” “the year 2003,” and “DNA sequences.” Not because these concepts are related, but because the model has compressed multiple unrelated features into the same neuron by using different activation patterns. This means you cannot understand the model by looking at individual neurons: you need to look at directions in the full activation space. This insight transformed the field.

Why does superposition happen? Toy models studied by Elhage et al. suggest that networks face a tradeoff: they can represent \(d\) features perfectly (one per neuron) or many more features imperfectly (using superposition). If most features are sparse (they only activate on a small fraction of inputs), the interference between superposed features is small, and the model benefits from representing more features. In practice, most natural language features are sparse (“Golden Gate Bridge” appears in a tiny fraction of all texts), so superposition is the optimal strategy.

14.4 Sparse Autoencoders: Decomposing Superposition

If features are superposed, how do you extract them? Anthropic's answer: train a sparse autoencoder (SAE) on the model's activations.

An SAE is a simple neural network with one hidden layer that is much wider than the input. It takes a model's activation vector (say, dimension 4096) and maps it through a hidden layer of dimension 65536 or higher, then reconstructs the original activation. The key constraint: the hidden representation must be sparse (most hidden units are zero for any given input).

Each active unit in the SAE's hidden layer corresponds to a feature: a direction in activation space that represents a specific concept. Because the hidden layer is much wider than the input, the SAE can represent many more features than there are neurons, decomposing the superposed representation into its constituent parts.

Anthropic's Scaling Monosemanticity

In their 2024 paper “Scaling Monosemanticity” (Templeton et al. 2024), Anthropic trained SAEs with up to 34 million features on Claude 3 Sonnet and found features for an astonishing range of concepts: specific people (Elon Musk, Taylor Swift), places (the Golden Gate Bridge, the Eiffel Tower), abstract concepts (deception, sycophancy, code quality), programming languages, mathematical notation, and much more. Many features were multilingual: the same feature activated for “deception” in English, French, Chinese, and other languages, suggesting the model has language-independent internal representations. And features were steerable: clamping a feature to a high value during inference caused the model to produce outputs related to that concept.

14.5 Representation Engineering

An alternative to decomposing individual features: study the geometry of the entire representation space. Representation engineering identifies directions in activation space that correspond to high-level properties like “truthfulness,” “harmfulness,” or “refusal,” and manipulates them directly.

The approach typically works by:

Collecting pairs of prompts that differ in one conceptual dimension (e.g., truthful vs. untruthful statements).
Computing the mean activation difference between the two groups at each layer.
Using this difference vector to steer the model at inference time by adding or subtracting it from the activations.

This is sometimes called “activation engineering” or “representation reading.” It has been used to make models more truthful, less toxic, and less likely to refuse legitimate requests, all without any retraining.

14.6 Automated Interpretability

Can you use language models to interpret language models? Bills et al. (Bills et al. 2023) tried exactly this: they used GPT-4 to generate natural language descriptions of what individual neurons in GPT-2 respond to, then scored those descriptions by how well they predicted neuron activations on new inputs.

This “LLMs interpreting LLMs” approach is appealing because it scales: you cannot afford to have human researchers manually inspect millions of neurons, but you can afford to have GPT-4 do it. The descriptions are often surprisingly insightful (“this neuron activates for months of the year when they appear in date formats”), though the approach has limitations for polysemantic neurons and specialized patterns that are hard to describe in natural language.

Neuronpedia

Neuronpedia (neuronpedia.org) is an interactive platform where you can browse the features discovered by sparse autoencoders across various models. Search for a concept (“Golden Gate Bridge,” “Python code,” “sarcasm”) and see which features respond to it, along with the top-activating examples from the training data. It is the closest thing we have to a “dictionary” for neural network features.

14.7 Tools for Interpretability Research

TransformerLens (Neel Nanda): The standard library for mechanistic interpretability. Loads transformer models with full hook access, enabling activation patching, ablation studies, and circuit analysis. Supports GPT-2, Pythia, and many other models.
SAELens: Tools for training and analyzing sparse autoencoders on language model activations. Used for feature extraction and analysis.
CircuitsVis: Interactive visualizations for attention patterns, activation patching results, and circuit diagrams. Created by the TransformerLens team.
pyvene: A library for performing interchange interventions (activation patching) on arbitrary PyTorch models, developed by Stanford's NLP group.

Getting Started with Mech Interp

If you want to get into mechanistic interpretability, here is the recommended path: (1) Read Neel Nanda's “Comprehensive Mechanistic Interpretability Explainer” blog post. (2) Work through the TransformerLens tutorials, starting with “Main Demo” and the “Exploratory Analysis” tutorial. (3) Read “A Mathematical Framework for Transformer Circuits” from the Transformer Circuits thread. (4) Read “Towards Monosemanticity” and “Scaling Monosemanticity” from Anthropic. (5) Pick a small model (GPT-2 Small or Pythia 70M), pick a behavior you want to understand, and try to find the circuit responsible. The ARENA (Alignment Research Engineer Accelerator) curriculum also has excellent exercises. The field is young and accessible: meaningful contributions do not require a PhD or massive compute resources.

14.8 Open Problems

Mechanistic interpretability is still in its early stages. Major open questions include:

Scaling to frontier models: Most interpretability work has been done on small models (GPT-2 with 85M parameters, Pythia models up to 6.9B). Anthropic's SAE work on Claude 3 Sonnet is the first major result on a frontier model, but much more work is needed.
Completeness: When you find a circuit, how do you know you have found everything? Models may have backup circuits, redundant pathways, and fallback mechanisms that only activate when the primary circuit fails.
Faithfulness of SAE features: Do the features discovered by sparse autoencoders truly reflect the model's internal ontology, or are they artifacts of the SAE architecture? The model did not learn these features directly; we are imposing a particular decomposition.
Safety applications: The ultimate goal is to use interpretability to detect dangerous behaviors (deception, power-seeking, misalignment) before deployment. We are not there yet, but the Anthropic team has shown that features related to deception and sycophancy can be identified in SAE analyses.
Feature universality: Do different models learn the same features? Early evidence suggests yes: similar features appear across model families trained on different data, suggesting a kind of convergent evolution in neural network representations.

14.9 Exercises

Install TransformerLens and load GPT-2 Small. Pick a head in layer 5 or 6 and visualize its attention patterns on ten different inputs. Can you characterize what this head does? Is it syntactic, positional, or semantic?
Replicate the induction head experiment: construct inputs of the form [A][B]...[A] and measure which attention heads attend from the second [A] back to the first [B]. Use TransformerLens's run_with_cache to extract attention patterns.
Perform activation patching on GPT-2 for a factual recall prompt like “The capital of Germany is.” Patch each attention head and MLP layer one at a time. Which components are causally necessary for producing “Berlin”?
Browse Neuronpedia and find five features from a GPT-2 SAE analysis. For each feature, examine the top-activating examples. Do the features seem monosemantic (single concept) or polysemantic (multiple concepts)? What fraction of features are interpretable?
Read the “Toy Models of Superposition” paper from Anthropic's Transformer Circuits thread. Reproduce the key experiment: train a simple autoencoder on synthetic data and observe how the model transitions from dedicated neurons to superposed representations as the number of features increases relative to the number of neurons.

References

Bills, Steven, Nick Cammarata, Dan Mossing, et al. 2023. “Language Models Can Explain Neurons in Language Models.” OpenAI Blog.

Bricken, Trenton, Adly Templeton, Joshua Batson, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Anthropic.

Conmy, Arthur, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. “Towards Automated Circuit Discovery for Mechanistic Interpretability.” arXiv Preprint arXiv:2304.14997.

Elhage, Nelson, Tristan Hume, Catherine Olsson, et al. 2022. “Toy Models of Superposition.” arXiv Preprint arXiv:2209.10652.

Elhage, Nelson, Neel Nanda, Catherine Olsson, et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread.

Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Anthropic.

Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.” arXiv Preprint arXiv:2211.00593.