Circuit Tracing: Revealing Computational Graphs in Language Models

LLMs/Interpretability 2026. 1. 9. 10:35

1. "model이 input prompt를 받았을 때, 어떤 mechanism에 의해서 output을 도출하는지를 실제로 설명하고자 한다"는
Antropic의 research topic에 대해서 흥미롭다고 생각했다.

즉, "단순히 model이 내놓은 COT을 통해서가 아니라, 실제로 *어떤 기전으로* 그 답을 내놓았는지를 살펴보자"는 것이다.

2. 그 mechanism을 설명하는 Attribution graph (일종의 causal graph)를 construct하는 building block들을 쌓았다.

가장 근간은 "독립된 path의 information flow를 연산하기 위한 circuit framework"이고,

노드를 담당하는 "feature"를 extract하고, attribution graph를 build하는 architecture인 SAE variant들과 algorithm을 살펴보았다.

Antropic은 본 paper에서,
3. performance (여러가지 기준이 있지만)을 높이기 위해 SAE의 variant들을 잘 조합하고, 몇 가지 추가적인 modification을 거쳐서 Cross-layer Transcoder를 기반으로한 local replacement model을 만든다.

그리고 이를 바탕으로 specific prompt에 대한 attribution graph를 도출하여서,

우리가 달성하고자 했던 목표 - model의 behavior에 대한 interpret - 을 한다.

그리고 "attribution graph를 도출하고 해석, 검증하는 방법"을 몇 가지 case study로 보여준다.

그런데, 내가 느낀 점은.. ㅎㅎ
이게 그렇게 단순하지가 않다는 것이다.

앞서서 우리의 목표를 달성하기 위한 building block을 쌓아올때는, 딱히 '애매하다'고 느끼지 않았다. (물론, feature를 해석하는 게 쉽진 않지만).

근데.. attribution graph는 좀 다르다.. ㅋㅋ

해석하고 검증하는 게 쉽지가 않다. 그렇게 단순하게 딱 떨어지지가 않는다.

(진짜 개인적인 생각이지만, 자칫하면 '코에 걸면 코걸이, 귀에 걸면 귀걸이'가 될 수도 있을 것만 같다.

나 이거 진짜 다른 분들한테도 물어보고 싶은건데, attribution graph 저만 해석하기 어렵나요................?)

물론, 그렇기 때문에 이게 연구 주제가 되는 거고, 연구를 발전시켜 나가야 할 필요성이 있는 거겠지만 말이다.

내가 이 주제를 접한지 불과 며칠 (?) 밖에 안되었고, 내가 천재가 아닌 이상, 한번 보고 바로 '이해 완료'하는 게 더 이상하겠지.. 충분히 이해하기 위해서 시간이 좀 필요할 것 같다.

Antropic team이 이 연구 주제를 몇년에 걸쳐서 발전시켜 오고 있는데, 이걸 보자마자 이해하려 드는 게 오히려 건방진 거겠죠...?

그리고 갠적인 생각인데, 사실 mechanistic interpretability가 faithfulness를 위한 거 잖아.
근데, 이 해석 자체가 정말 meaningful 하고 reasonable한지 evaluate하는 게 중요할 거라는 생각이 들었어..

막상 case study를 보면서, 실제 usecase를 보니까.. high-level idea만큼 그렇게 단순하지가 않거덩..

(심지어 arithmetic operation 도 그렇다.)

다른 분들의 생각은 어떤지 궁금하다..!!

https://transformer-circuits.pub/2025/attribution-graphs/methods.html

Circuit Tracing: Revealing Computational Graphs in Language Models

We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.

transformer-circuits.pub

'LLMs > Interpretability' 카테고리의 다른 글

[Circuit Tracing Examples 0] Methodology (0)	2026.01.09
Assessing skeptical views of interpretability research (0)	2026.01.09
[Crosscoders] Sparse Crosscoders for Cross-Layer Features and Model Diffing (0)	2026.01.08
[Transcoders] Find Interpretable LLM Feature Circuits (0)	2026.01.08
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (0)	2026.01.07

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'LLMs > Interpretability' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'LLMs > Interpretability' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바