-
[Crosscoders] Sparse Crosscoders for Cross-Layer Features and Model DiffingLLMs/Interpretability 2026. 1. 8. 16:18
https://transformer-circuits.pub/2024/crosscoders/index.html
Sparse Crosscoders for Cross-Layer Features and Model Diffing
Authors Jack Lindsey*, Adly Templeton*, Jonathan Marcus*, Thomas Conerly*, Joshua Batson, Christopher Olah
transformer-circuits.pub
Where autoencoders encode and predict activations at a single layer, and transcoders use activations from one layer to predict the next, a crosscoder reads and writes to multiple layers. Crosscoders produce shared features across layers and even models. They have several applications:
1. Cross-Layer Features - Crosscoders allow us to think of features as being spread across layers, resolving cross-layer superposition and tracking persistent features through the residual stream.2. Circuit Simplification - By tracking features that continue to exist in the residual stream, crosscoders can remove "duplicate features" from analysis and allow features to "jump" across many uninteresting identity circuit connections, and generally simplify circuits.
3. Model Diffing - Crosscoders can produce shared sets of features across models. This includes one model across training or finetuning, and also completely independent models with different architectures.





'LLMs > Interpretability' 카테고리의 다른 글