[Circuit Tracing Examples 1] Multi-Step Reasoning

LLMs/Interpretability 2026. 1. 9. 14:48

* case study 1: LLM이 실제로 내부적으로도 multi-step reasoning함을 circuit analysis로 보인다.

paper에서 제시한 experiment result diagram은 굉장히 clear해 보이지만, full attribution graph를 보면, 막막하다. ㅋㅋ

feature가 '나는 어떤 feature입니다'라고 label을 달고 있는 게 아닌데, feature를 interprete하고, 가장 meaningful한 것을 골라 grouping해서, 엄청난 edge와 node를 trimming해서 해석가능한 diagram을 만드는 게 challenging할 것 같다.

※ 그리고 나 갑자기 한가지 궁금증이 생겼는데, Mixture of Experts가 effective한 이유는, 특화된 experts 때문이잖아. 그럼, 각 experts가 activated 되었을 때, MLP의 mechanism을 circuit analysis하면, 각 expert만의 서로 다른 독특한 mechanism을 포착할 수 있을까?

Let’s consider the prompt Fact: the capital of the state containing Dallas is, which Claude 3.5 Haiku successfully completes with Austin. Intuitively, this completion requires two steps – first, inferring that the state containing Dallas is Texas, and second, that the capital of Texas is Austin. Does Claude actually perform these two steps internally? Or does it use some “shortcut” (e.g. perhaps it has observed a similar sentence in the training data and simply memorized the completion)?

The model performs genuine two-step reasoning internally, which coexists alongside “shortcut” reasoning.

1. Building Attribution Graph & interpretation

Computing the attribution graph for this prompt, which describes the features the model used to produce its answer, and the interactions between them.

(1) Make diagram with nodes and edges

First, we examine the features’ visualizations to interpret them, and group them into categories (“supernodes”).

Second, after forming these supernodes, we can see in our attribution graph interface that, for example, the “capital” supernode promotes the “say a capital” supernode, which promotes the “say Austin” supernode. To represent this, we draw a diagram where each supernode is connected to the next with a brown arrow:

Third, after labeling more features and forming more supernodes, we summarize their interactions in the following diagram.

(2) Interpretation

The attribution graph contains multiple interesting paths, which we summarize below:

The Dallas features (with some contribution from state features) activate a group of features that represent concepts related to the state of Texas.
In parallel, the features activated by the word capital activate another cluster of output features that cause the model to say the name of a capital (an example of such a feature can be seen above).
The Texas features and the say a capital features jointly upweight the probability of the model saying Austin. They do so via two pathways:

directly impacting the Austin output, and
indirectly, by activating a cluster of say Austin output features.

There also exists a “shortcut” edge directly from Dallas to say Austin.

The graph indicates that the replacement model does in fact perform “multi-hop reasoning” – that is, its decision to say Austin hinges on a chain of several intermediate computational steps (Dallas → Texas, and Texas + capital → Austin).

2. Validation with Inhibition Experiments

The graphs above describe mechanisms used by our interpretable replacement model. To validate that these mechanisms are representative of the actual model, we performed intervention experiments on the feature groups above by inhibiting each of them (clamping them to a negative multiple of their original value) and measuring the impact on the activations of features in the other clusters, as well as on the model output.

The summary plot above confirms the major effects predicted by the graph. For instance, inhibiting “Dallas” features decreases the activation of “Texas” features (and features downstream of “Texas,” like “Say Austin”) but leaves “say a capital” features largely unaffected. Likewise, inhibiting “capital” features decreases the activation of “say a capital” features (and those downstream, like “say Austin”) while leaving “Texas” features largely unchanged.

The effects of inhibiting features on model predictions are also semantically reasonable. For instance, inhibiting the “Dallas” cluster causes the model to output other state capitals, while inhibiting the “say a capital” cluster causes it to output non-capital completions.

3. Swapping Alternative Features

If the model’s completion truly is mediated by an intermediate “Texas” step, we should be able to change its output to a different state capital by replacing the model’s representation of Texas with that of another state.

To identify features representing another state, we consider a related prompt, where we use “Oakland” instead of “Dallas” – Fact: the capital of the state containing Oakland is. Repeating the analysis steps above, we arrive at the following summary graph:

This graph is analogous to our original graph, with “Oakland” taking the place of “Dallas,” “California” taking the place of “Texas,” and “say Sacramento” taking the place of “say Austin.”

We now return to our original prompt, and swap “Texas” for “California” by inhibiting the activations of the Texas cluster and activating the California features identified from the “Oakland” prompt. In response to these perturbations, the model outputs “Sacramento” (the capital of California).

Similarly,

An analogous prompt about the state containing Savannah activates “Georgia” features. Swapping these for the “Texas” features causes the model to output “Atlanta” (the capital of Georgia).

An analogous prompt about the province containing Vancouver activates “British Columbia” features. Swapping these for the “Texas” features causes the model to output “Victoria” (the capital of British Columbia).

An analogous prompt about the country containing Shanghai activates “China” features. Swapping these for the “Texas” features causes the model to output “Beijing” (the capital of China).

An analogous prompt about the empire containing Thessaloniki activates “Byzantine Empire” features. Swapping these for the “Texas” features causes the model to output “Constantinople” (the capital of the ancient Byzantine Empire).

Note that in some cases the magnitude of the feature injection required to change the model’s output is larger (see bottom row). Interestingly, these correspond to cases where the features being injected do not correspond to a U.S. state, suggesting that these features may “fit” less naturally into the circuit mechanisms active in the original prompt.

'LLMs > Interpretability' 카테고리의 다른 글

!! Mechanistic Interpretability가 Causal Representation & discovery에 답을 주었다!! (0)	2026.01.09
[Circuit Tracing Examples 0] Methodology (0)	2026.01.09
Assessing skeptical views of interpretability research (0)	2026.01.09
Circuit Tracing: Revealing Computational Graphs in Language Models (0)	2026.01.09
[Crosscoders] Sparse Crosscoders for Cross-Layer Features and Model Diffing (0)	2026.01.08

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지