-
[Circuit Tracing Examples 0] MethodologyLLMs/Interpretability 2026. 1. 9. 12:24
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
On the Biology of a Large Language Model
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
transformer-circuits.pub
다양한 context 하에서 LLM 내부의 mechanism을 관찰한다.
기본적인 setup은 아래와 같다.
Replacement Model
Replacement model approximately reproduces the activations of the original model using more interpretable components. Replacement model is based on a cross-layer transcoder (CLT) architecture, which is trained to replace the model’s MLP neurons with features, sparsely active “replacement neurons” that often represent interpretable concepts.
In this paper, we use a CLT with a total of 30 million features across all layers.
Local Replacement Model
Replacement models don’t perfectly reconstruct the activations of the original model. On any given prompt, there are gaps between the two. We can fill in these gaps by including error nodes which represent the discrepancy between the two models. Unlike features, we can’t interpret error nodes. But including them gives us a more precise sense of how incomplete our explanations are. Our replacement model also doesn’t attempt to replace the attention layers of the original model. On any given prompt, we simply use the attention patterns of the original model and treat them as fixed components.
The resulting model – incorporating error nodes and inheriting the attention patterns from the original model – we call the local replacement model. It is “local” to a given prompt because error nodes and attention patterns vary between different prompts. But it still represents as much of the original model’s computation as possible using (somewhat) interpretable features.
Attribution Graphs
By studying the interactions between features in the local replacement model, we can trace its intermediate steps as it produces responses. More concretely, we produce attribution graphs, a graphical representation of the computational steps the model uses to determine its output for a particular input, in which nodes represent features and edges represent the causal interactions between them. As attribution graphs can be quite complex, we prune them to their most important components by removing nodes and edges that do not contribute significantly to the model’s output.
With a pruned attribution graph in hand, we often observe groups of features with related meanings that play a similar role in the graph. By manually grouping these related graph nodes together into supernodes, we can obtain a simplified depiction of the computational steps performed by the model.

Limitations & How to Validate
Because they are based on our replacement model, we cannot use attribution graphs to draw conclusions with certainty about the underlying model (i.e. Claude 3.5 Haiku). Thus, the attribution graphs provide hypotheses about mechanisms operating in the underlying model. To gain confidence that the mechanisms we describe are real and significant, we can perform intervention experiments in the original model, such as inhibiting feature groups and observing their effects on other features and on the model’s output. If the effects are consistent with what our attribution graph predicts, we gain confidence that the graph is capturing real (though potentially incomplete) mechanisms within the model.

'LLMs > Interpretability' 카테고리의 다른 글