Identifying Functionally Important Features with End-to-End Sparse
Dictionary Learning

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

17 May 2024

Jordan K. Taylor

Nicholas Goldowsky-Dill

Papers citing "Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning"

12 / 12 papers shown

Title
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 68 0 0 21 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 95 1 0 13 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Zhen Zhang Biwei Huang Anton van den Hengel Javen Qinfeng Shi Javen Qinfeng Shi 386 0 0 12 Mar 2025
Discovering Chunks in Neural Embeddings for Interpretability Shuchen Wu Stephan Alaniz Eric Schulz Zeynep Akata 71 0 0 03 Feb 2025
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 81 8 0 15 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 67 5 0 06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 142 32 0 02 Jul 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models Kiho Park Yo Joong Choe Yibo Jiang Victor Veitch 74 37 0 03 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 116 148 0 28 Mar 2024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 65 29 0 26 Oct 2023
Axiomatic Attribution for Deep Networks Mukund Sundararajan Ankur Taly Qiqi Yan OOD FAtt 177 5,986 0 04 Mar 2017
Linear Algebraic Structure of Word Senses, with Applications to Polysemy Sanjeev Arora Yuanzhi Li Yingyu Liang Tengyu Ma Andrej Risteski 75 282 0 14 Jan 2016