11 LLM Explainability

You ask a language model whether a patient should receive a particular drug. It says yes. You deploy this in a hospital. A patient dies. The family sues. The lawyer asks: why did the AI recommend this drug? You stare at 7 billion floating-point numbers and shrug.

This is the explainability problem, and it is not hypothetical. As LLMs are deployed in medicine, law, finance, and criminal justice, the ability to explain why a model produced a particular output is becoming a regulatory and ethical necessity. The EU AI Act, for example, requires that high-risk AI systems provide meaningful explanations of their decisions.

Explainability vs. Interpretability

These terms are often used interchangeably, but they mean different things. Explainability asks: can we provide a human-understandable justification for why the model produced this output? Interpretability (Chapter 10) asks: can we reverse-engineer the internal mechanisms by which the model computes its output? Explainability is about the “what and why” seen from the outside. Interpretability is about the “how” on the inside. A model can be explainable without being interpretable (a good chain-of-thought explanation) and interpretable without being explainable (we found the circuit, but good luck explaining it to a judge).

11.1 Attention Maps: The Obvious First Attempt

The transformer's self-attention mechanism (Vaswani et al. 2017) computes attention weights that indicate how much each token “attends to” every other token. The natural first instinct is to visualize these weights and declare: “Look, the model is paying attention to these words, so that's why it produced this output.”

In practice, you can extract attention weights from any Hugging Face model by passing output_attentions=True to the forward call. This returns a tuple of tensors, one per layer, each of shape (batch, heads, sequence length, sequence length). Tools like BertViz create beautiful interactive visualizations that let you explore attention patterns head by head.

The Attention Fallacy

Attention weights are seductive but misleading. Research has shown that attention does not reliably explain model decisions. Sarah Wiegreffe and Yuval Pinter's paper “Attention is not Explanation” (Wiegreffe and Pinter 2019) (2019) demonstrated that models with very different attention distributions can produce identical outputs. Attention can be redundant (multiple heads attend to the same tokens), misleading (high attention does not imply causal importance), and context-dependent (the same head attends to different things depending on the input). If someone shows you an attention map and says “this is why the model made this decision,” be skeptical.

Despite these limitations, attention maps remain useful for exploration (they can reveal surprising patterns and generate hypotheses) and for debugging (noticing that a model attends only to the first three tokens of every input suggests a bug). Just do not treat them as ground-truth explanations.

11.2 Feature Attribution: Who Gets the Credit?

Feature attribution methods assign an importance score to each input token for a given output. The question is simple: which parts of the input were most responsible for the output?

11.2.1 Gradient-Based Methods

The most principled approach uses gradients. Since neural networks are differentiable, you can compute the gradient of the output with respect to each input token embedding and use the magnitude of this gradient as a measure of importance.

Vanilla gradients: Simply compute \(\frac{\partial y}{\partial x_i}\) for each input token \(x_i\). Fast but noisy.
Gradient \(\times\) input: Multiply the gradient by the input embedding itself. This tends to produce cleaner attributions because it accounts for the magnitude of the input features, not just their sensitivity.
Integrated gradients: Average the gradients along a straight path from a baseline (typically a zero embedding or a padding token) to the actual input. This satisfies desirable axiomatic properties (sensitivity and implementation invariance) that simpler gradient methods lack.

11.2.2 Perturbation-Based Methods

An alternative to gradients: just remove or replace input tokens and see what happens.

LIME (Local Interpretable Model-agnostic Explanations) (Ribeiro et al. 2016): Perturb the input by randomly masking tokens, observe the output changes, and fit a simple linear model to approximate the local decision boundary.
SHAP (SHapley Additive exPlanations) (Lundberg and Lee 2017): Based on Shapley values from cooperative game theory. Each token's importance is its average marginal contribution across all possible subsets of tokens. Theoretically elegant but computationally expensive (exponential in the number of tokens without approximations).

Shapley Values in 30 Seconds

Imagine you are splitting a restaurant bill among friends who each ordered different items and shared some dishes. Shapley values give the “fair” split: each person's contribution is computed by averaging over all possible orderings in which people could have arrived and ordered. In the AI context, each token's Shapley value is its fair share of the model's output, averaged over all possible combinations of other tokens being present or absent.

11.2.3 Attention Rollout

Attention weights at individual layers only show local information flow. Attention rollout propagates attention through the entire network by recursively multiplying attention matrices from the first layer to the last, incorporating residual connections. This gives a more global view of which input tokens ultimately influence the final output.

11.3 The Logit Lens: Peeking Inside the Computation

The logit lens, introduced by nostalgebraist (a pseudonymous researcher), is one of the most elegant and accessible tools for understanding what happens inside a transformer.

The idea is beautifully simple: at each layer, take the intermediate hidden state and project it through the model's output (unembedding) matrix. This tells you what token the model would predict if it stopped computing at that layer. By examining how the prediction changes from layer to layer, you can watch the model's “beliefs” evolve as information flows through the network.

What the Logit Lens Reveals

Apply the logit lens to a GPT model processing the sentence “The capital of France is” and you see something remarkable: in early layers, the model might predict random high-frequency words. By the middle layers, it starts predicting geography-related tokens. By the final layers, it confidently predicts “Paris.” You are watching the model retrieve a fact from its parameters, layer by layer. The tuned lens (Belrose et al., 2023) improves on this by training a small affine transformation at each layer, accounting for the fact that intermediate representations are not meant to be directly decoded.

11.4 Linear Probing: What Does the Model Know?

If you want to know whether a model has learned a particular concept, you can train a linear probe: a simple linear classifier on top of the model's hidden representations at a specific layer.

For example, you might want to know: does GPT-2 know the part of speech of each token? Train a linear probe to predict POS tags from the hidden states at each layer. If the probe achieves high accuracy at layer 6 but not at layer 2, you know the model has extracted POS information by layer 6.

Linear probes have been used to detect all sorts of encoded knowledge: named entities, syntactic structure, factual knowledge (“Berlin is the capital of Germany”), and even world models (like Othello board states, as shown by Kenneth Li et al.).

The Probing Paradox

A high-accuracy probe shows that information is linearly accessible in the representation, but it does not prove the model uses that information. A representation could encode POS tags without the downstream computation ever relying on them. This is the distinction between encoding and use, and it is one of the reasons explainability researchers are increasingly turning to causal methods (activation patching, ablation studies) rather than purely correlational ones.

11.5 Chain-of-Thought as Self-Explanation

Chain-of-thought prompting (Wei et al. 2022) asks the model to “think step by step” before producing a final answer. The resulting reasoning trace serves as a form of self-explanation: the model tells you why it reached its conclusion.

This is remarkably useful in practice. When a model shows its work, humans can verify the reasoning, catch errors, and build trust in the output. Deployed systems increasingly use chain-of-thought for exactly this reason.

But there is a fundamental caveat: the stated reasoning may not reflect the actual computation. Research from Anthropic and others has shown that models can produce convincing chain-of-thought explanations that are post-hoc rationalizations rather than faithful descriptions of internal processing. The model might arrive at the answer through an entirely different internal mechanism and then generate a plausible-sounding explanation.

Faithful vs. Plausible Explanations

A faithful explanation accurately describes the causal process that led to the output. A plausible explanation sounds convincing to a human but may not reflect what actually happened. LLMs are extraordinarily good at generating plausible explanations (that is literally what they are trained to do), which makes it dangerously easy to mistake plausibility for faithfulness. This is why mechanistic interpretability (Chapter 10) exists: to look at what actually happens inside the model, rather than trusting what the model says about itself.

11.6 Explainability in Practice: Tools and Libraries

Captum (PyTorch): Meta's comprehensive attribution library. Provides integrated gradients, SHAP, DeepLIFT, GradCAM, and other methods. Works with any PyTorch model, including transformers.
BertViz (Jesse Vig): Interactive attention visualization for any Hugging Face transformer. Supports head view (individual attention heads), model view (all heads at once), and neuron view (how individual neurons contribute to attention).
Ecco: An interactive library for exploring language model internals, including logit lens analysis, neuron activation analysis, and input saliency maps. Built by Jay Alammar, who also wrote the famous “The Illustrated Transformer” blog post.
TransformerLens: Neel Nanda's library for mechanistic interpretability research. While primarily an interpretability tool (discussed in depth in Chapter 10), it also provides logit lens, attention pattern analysis, and activation caching that are useful for explainability work.
LLM Transparency Tool (Meta FAIR): A web interface for interactively exploring how information flows through transformer layers, including contribution analysis and logit lens.

The Illustrated Transformer

If you have not read Jay Alammar's “The Illustrated Transformer” blog post, stop everything and go read it. It remains the single best visual explanation of the transformer architecture ever written. Alammar later wrote illustrated guides to GPT-2, BERT, and other models. His Ecco library extends this visual approach to interactive exploration.

11.7 The Limits of Explainability

Explainability is necessary but not sufficient. Even the best feature attribution map cannot tell you how the model combines those features to produce an output. Knowing that tokens “France” and “capital” were important for predicting “Paris” does not explain the computation that retrieved this fact from the model's parameters.

This is why explainability and interpretability are complementary, not competing, approaches. Explainability tells you what mattered. Interpretability (Chapter 10) tells you how it was processed. Together, they form a more complete understanding of model behavior.

The field is rapidly evolving. As models grow larger and more capable, and as regulatory requirements tighten, explainability is transitioning from a research curiosity to a practical necessity. The tools and techniques in this chapter are your starting point.

11.8 Exercises

Load GPT-2 from Hugging Face and use BertViz to visualize its attention patterns on a few sentences. Can you find heads that consistently attend to syntactic relationships (e.g., subject-verb, adjective-noun)? Can you find heads that attend to positional patterns regardless of content?
Apply the logit lens to GPT-2 or Pythia (using TransformerLens) on the prompt “The Eiffel Tower is located in the city of.” At which layer does the model first start predicting “Paris”? How does the prediction distribution evolve through the layers?
Use Captum to compute integrated gradients for a BERT sentiment classifier on a movie review. Compare the token importance scores with the attention weights. Do they agree? Where do they disagree, and what might that tell you?
Write a prompt that elicits chain-of-thought reasoning from an LLM, then deliberately modify the chain-of-thought to contain an error while keeping the final answer correct. Does the model notice the inconsistency when you ask it to verify its own reasoning?
Take a factual question like “What is the capital of Australia?” and use SHAP to determine which tokens in the question most influence the answer. Then rephrase the question five different ways and see how the attributions change. Are the explanations stable across rephrasings?

References

Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” Advances in Neural Information Processing Systems 30.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” arXiv Preprint arXiv:1602.04938.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv Preprint arXiv:2201.11903.

Wiegreffe, Sarah, and Yuval Pinter. 2019. “Attention Is Not Not Explanation.” arXiv Preprint arXiv:1908.04626.