Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.12241
Cited By
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
17 May 2024
Dan Braun
Jordan K. Taylor
Nicholas Goldowsky-Dill
Lee D. Sharkey
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning"
12 / 12 papers shown
Title
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
68
0
0
21 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
95
1
0
13 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Zhen Zhang
Biwei Huang
Anton van den Hengel
Javen Qinfeng Shi
Javen Qinfeng Shi
386
0
0
12 Mar 2025
Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu
Stephan Alaniz
Eric Schulz
Zeynep Akata
71
0
0
03 Feb 2025
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
81
8
0
15 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
67
5
0
06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
142
32
0
02 Jul 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
74
37
0
03 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
116
148
0
28 Mar 2024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
65
29
0
26 Oct 2023
Axiomatic Attribution for Deep Networks
Mukund Sundararajan
Ankur Taly
Qiqi Yan
OOD
FAtt
177
5,986
0
04 Mar 2017
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Sanjeev Arora
Yuanzhi Li
Yingyu Liang
Tengyu Ma
Andrej Risteski
75
282
0
14 Jan 2016
1