Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.00113
Cited By
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
31 July 2024
Adam Karvonen
Benjamin Wright
Can Rager
Rico Angell
Jannik Brinkmann
Logan Smith
C. M. Verdun
David Bau
Samuel Marks
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models"
26 / 26 papers shown
Title
Sparsification and Reconstruction from the Perspective of Representation Geometry
Wenjie Sun
Bingzhe Wu
Zhile Yang
Chengke Wu
79
0
0
28 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Patrick Leask
Neel Nanda
Noura Al Moubayed
94
1
0
23 May 2025
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Zihao Li
Xu Wang
Yuzhe Yang
Ziyu Yao
Haoyi Xiong
Jundong Li
LLMSV
LRM
129
3
0
21 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
165
0
0
21 May 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Bart Bussmann
Noa Nabeshima
Adam Karvonen
Neel Nanda
129
13
0
21 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
90
0
0
21 Mar 2025
Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms
Xiaojian Li
Yongkang Leng
Ruiqing Ding
Hangjie Mo
Shanlin Yang
LRM
80
1
0
15 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Zhen Zhang
Biwei Huang
Anton van den Hengel
Javen Qinfeng Shi
Javen Qinfeng Shi
465
1
0
12 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
MU
181
23
0
12 Mar 2025
Mixture of Experts Made Intrinsically Interpretable
Xingyi Yang
Constantin Venhoff
Ashkan Khakzar
Christian Schroeder de Witt
P. Dokania
Adel Bibi
Philip Torr
MoE
125
1
0
05 Mar 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
149
17
0
23 Feb 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Thomas Fel
Ekdeep Singh Lubana
Jacob S. Prince
M. Kowal
Victor Boutin
Isabel Papadimitriou
Binxu Wang
Martin Wattenberg
Demba Ba
Talia Konkle
81
8
0
18 Feb 2025
The Complexity of Learning Sparse Superposed Features with Feedback
Akash Kumar
482
0
0
08 Feb 2025
Can Input Attributions Explain Inductive Reasoning in In-Context Learning?
Mengyu Ye
Tatsuki Kuribayashi
Goro Kobayashi
Jun Suzuki
LRM
172
0
0
20 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
442
1
0
16 Dec 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
121
11
0
07 Nov 2024
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
Kola Ayonrinde
107
5
0
04 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
Federico Belotti
Marco Molinari
82
6
0
28 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
109
16
0
18 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
Constantin Venhoff
Anisoara Calinescu
Philip Torr
Christian Schroeder de Witt
74
0
0
09 Oct 2024
Mechanistic?
Naomi Saphra
Sarah Wiegreffe
AI4CE
80
13
0
07 Oct 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin
James Wilken-Smith
Tomáš Dulka
Hardik Bhatnagar
Joseph Bloom
Joseph Isaac Bloom
130
37
0
22 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
124
128
0
09 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
132
25
0
02 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
145
39
0
22 Jul 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
182
159
0
28 Mar 2024
1