Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

LLMs/Interpretability 2026. 1. 6. 23:19

https://transformer-circuits.pub/2023/monosemantic-features/index.html

Authors Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Ta

transformer-circuits.pub

Sparse Autoencoder로 feature extraction!

https://www.alignmentforum.org/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s

Colab Demo!

'LLMs > Interpretability' 카테고리의 다른 글

[Transcoders] Find Interpretable LLM Feature Circuits (0)	2026.01.08
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (0)	2026.01.07
Superposition (0)	2026.01.06
Why induction head in Transformer is important for meta-learning? (0)	2026.01.06
Transformer Circuits (0)	2026.01.06

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'LLMs > Interpretability' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'LLMs > Interpretability' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바