LLMs/Interpretability
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
밤 편지
2026. 1. 6. 23:19
https://transformer-circuits.pub/2023/monosemantic-features/index.html
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Authors Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Ta
transformer-circuits.pub
Sparse Autoencoder로 feature extraction!
Colab Demo!