(Gentle Intro..) [Interpretability] how LLMs really work

LLMs/Interpretability 2026. 1. 5. 11:42

LLM 내부에서 어떤 일이 벌어지고 있는지 살펴본다.

LLM의 activation을 수정함을 통해서 model의 behavior를 steering 할 수 있다.

Golden Gate에 낚였다..

Golden Gate experiment를 보고 빵 터져서..

intrinsic feature를 manupulate함으로써 LLM의 behavior를 steering하는 게 흥미롭게 다가와서

관련 article을 읽었는데..

아래의 article들은 intro 차원에서 gentle하지만

anthropic의 original paper들은 만만치 않다.

original paper들을 읽고 있는 중인데, 내가 이해할 수 있는지를 보고나서.. 포스팅 해야겠다..ㅎ

(2024.05.41) Golden Gate Claude

https://www.anthropic.com/news/golden-gate-claude

Golden Gate Claude

When we turn up the strength of the “Golden Gate Bridge” feature, Claude’s responses begin to focus on the Golden Gate Bridge. For a short time, we’re making this model available for everyone to interact with.

www.anthropic.com

(2024.05.21) Mapping the Mind of a Large Language Model

https://www.anthropic.com/news/mapping-mind-language-model

Mapping the Mind of a Large Language Model

We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model.

www.anthropic.com

(2023. 10.05) Decomposing Language Models Into Understandable Components

https://www.anthropic.com/news/decomposing-language-models-into-understandable-components

Decomposing Language Models Into Understandable Components

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

www.anthropic.com

'LLMs > Interpretability' 카테고리의 다른 글

Superposition (0)	2026.01.06
Why induction head in Transformer is important for meta-learning? (0)	2026.01.06
Transformer Circuits (0)	2026.01.06
Circuits (0)	2026.01.05
잡설 (0)	2026.01.05

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'LLMs > Interpretability' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'LLMs > Interpretability' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바