Improving Dictionary Learning with Gated Sparse Autoencoders

Improving Dictionary Learning with Gated Sparse Autoencoders

24 April 2024

Senthooran Rajamanoharan

Papers citing "Improving Dictionary Learning with Gated Sparse Autoencoders"

19 / 19 papers shown

Title
Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning Xinyue Wang Biwei Huang OffRL CML 29 0 0 13 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 70 0 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 72 0 0 28 Apr 2025
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 110 1 0 10 Apr 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 34 0 0 21 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Biwei Huang Mingming Gong Anton van den Hengel Javen Qinfeng Shi J. Shi 154 0 0 12 Mar 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons Yuheng Chen Pengfei Cao Kang Liu Jun Zhao 47 0 0 18 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 49 1 0 09 Jan 2025
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models Konstantin Donhauser Kristina Ulicna Gemma Elyse Moran Aditya Ravuri Kian Kenyon-Dean Cian Eastwood Jason Hartford 76 0 0 20 Dec 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 93 1 0 20 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 40 5 0 07 Nov 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 42 7 0 15 Oct 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 58 13 0 13 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP Yossi Gandelsman Alexei A. Efros Jacob Steinhardt MILM 56 16 0 06 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 46 111 0 28 Mar 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 186 0 02 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 494 0 01 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 317 0 21 Sep 2022