-
Towards Monosemanticity: Decomposing Language Models With Dictionary LearningLLMs/Interpretability 2026. 1. 6. 23:19
https://transformer-circuits.pub/2023/monosemantic-features/index.html
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Authors Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Ta
transformer-circuits.pub
Sparse Autoencoder로 feature extraction!
Colab Demo!
'LLMs > Interpretability' 카테고리의 다른 글
[Transcoders] Find Interpretable LLM Feature Circuits (0) 2026.01.08 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (0) 2026.01.07 Superposition (0) 2026.01.06 Why induction head in Transformer is important for meta-learning? (0) 2026.01.06 Transformer Circuits (0) 2026.01.06