Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

LLMs/Interpretability

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

밤 편지 2026. 1. 6. 23:19

https://transformer-circuits.pub/2023/monosemantic-features/index.html

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Authors Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Ta

transformer-circuits.pub

Sparse Autoencoder로 feature extraction!

https://www.alignmentforum.org/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s

Colab Demo!