Assessing skeptical views of interpretability research

LLMs/Interpretability 2026. 1. 9. 11:50

건설적인 발전을 위해서는, 비판적인 시각을 받아들여야 한다.

연구의 당위성에 대하여 생각해볼 필요가 있다.

제작년 Stanford NLP class에서 interpretability lecture를 접했던 것 같다. 크게 interest를 갖지 않고 넘어갔었다.

이번에 interpretability에 관심을 갖게 된 계기는 아주 우연이었다.

"Golden gate experiment!"

너무 엉뚱한 동기로 관심을 갖게 되었나요? ㅎㅎ

아무튼, Antropic의 golden gate 낚시 (?)는 성공했다.

Transformer circuit framework 를 받아들이면서, '이 연산이 맞는 건가?' 생각하는데 시간이 좀 걸리긴 했지만,

받아들인 후로는 좀 재미가 있었다.

Causal inference 이후로, 10시간씩 공부한 건 오랜만이다.

문제는..

high-level idea에서는 굉장히 공감이 갔지만,

feature 부터, 흔들리기 시작했다는 것이다. ㅎㅎ

굉장히.. 해석이 난해하다고 느껴졌다.

느낌 상, skeptical한 시각이 있을 것 같았다.

역시 있다..!

그리고 skepticism에 대해 굉장히 detail하고 사례를 근거로 한 defense를 해주신다.

앞으로의 방향성도 제시해주신다.

(가려운 부분을 시원하게 긁어주신다!!)

(여담인데, 한 때, 'GPT-4o가 왜 이렇게 느끼해졌지?' 하고 느낀 적이 있었는데, sycophancy issue가 있었구나..!)

https://web.stanford.edu/~cgpotts/blog/interp/

Assessing skeptical views of interpretability research | Christopher Potts

Goodfire and Anthropic have jointly organized a meet-up of academic and industry researchers called “Interpretability: the next 5 years”, to be held later this month. Participants have been invited to contribute short discussion documents. This is a dr

web.stanford.edu

https://www.youtube.com/watch?v=woo_J0RKcpQ&t=1910s

'LLMs > Interpretability' 카테고리의 다른 글

[Circuit Tracing Examples 1] Multi-Step Reasoning (0)	2026.01.09
[Circuit Tracing Examples 0] Methodology (0)	2026.01.09
Circuit Tracing: Revealing Computational Graphs in Language Models (0)	2026.01.09
[Crosscoders] Sparse Crosscoders for Cross-Layer Features and Model Diffing (0)	2026.01.08
[Transcoders] Find Interpretable LLM Feature Circuits (0)	2026.01.08

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'LLMs > Interpretability' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'LLMs > Interpretability' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바