-
LLM 내부에서 어떤 일이 벌어지고 있는지 살펴본다.
LLM의 activation을 수정함을 통해서 model의 behavior를 steering 할 수 있다.
Golden Gate에 낚였다..
Golden Gate experiment를 보고 빵 터져서..
intrinsic feature를 manupulate함으로써 LLM의 behavior를 steering하는 게 흥미롭게 다가와서
관련 article을 읽었는데..
아래의 article들은 intro 차원에서 gentle하지만
anthropic의 original paper들은 만만치 않다.
original paper들을 읽고 있는 중인데, 내가 이해할 수 있는지를 보고나서.. 포스팅 해야겠다..ㅎ
(2024.05.41) Golden Gate Claude
https://www.anthropic.com/news/golden-gate-claude
Golden Gate Claude
When we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. For a short time, we’re making this model available for everyone to interact with.
www.anthropic.com
(2024.05.21) Mapping the Mind of a Large Language Model
https://www.anthropic.com/news/mapping-mind-language-model
Mapping the Mind of a Large Language Model
We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model.
www.anthropic.com
(2023. 10.05) Decomposing Language Models Into Understandable Components
https://www.anthropic.com/news/decomposing-language-models-into-understandable-components
Decomposing Language Models Into Understandable Components
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
www.anthropic.com
'LLMs > Interpretability' 카테고리의 다른 글
Superposition (0) 2026.01.06 Why induction head in Transformer is important for meta-learning? (0) 2026.01.06 Transformer Circuits (0) 2026.01.06 Circuits (0) 2026.01.05 잡설 (0) 2026.01.05