Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

23 February 2025

Subhash Kantamneni

Joshua Engels

Senthooran Rajamanoharan

Max Tegmark

Neel Nanda

ArXiv PDF HTML

Papers citing "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing"

24 / 24 papers shown

Title
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders James Oldfield Shawn Im Yixuan Li M. Nicolaou Ioannis Patras Grigorios G. Chrysos MoE 43 0 0 27 May 2025
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms Mengru Wang Ziwen Xu Shengyu Mao Shumin Deng Zhaopeng Tu Ningyu Zhang N. Zhang LLMSV 73 0 0 23 May 2025
TRACE for Tracking the Emergence of Semantic Representations in Transformers Nura Aljaafari Danilo S. Carvalho André Freitas 69 0 0 23 May 2025
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models Zirui He Mingyu Jin Bo Shen Ali Payani Yongfeng Zhang Mengnan Du LLMSV 59 0 0 22 May 2025
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders David Chanin Tomáš Dulka Adrià Garriga-Alonso 49 0 0 16 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 69 0 0 15 May 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring Henk Tillman Dan Mossing LLMSV 74 0 0 28 Apr 2025
Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis Alexander Ku Declan Campbell Xuechunzi Bai Jiayi Geng Ryan Liu ... Ilia Sucholutsky Veniamin Veselovsky Liyi Zhang Jian-Qiao Zhu Thomas L. Griffiths ELM 126 4 0 17 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Zhen Zhang Biwei Huang Anton van den Hengel Javen Qinfeng Shi Javen Qinfeng Shi 381 0 0 12 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 83 9 0 03 Mar 2025
Sparse Autoencoders Can Interpret Randomly Initialized Transformers Thomas Heap Tim Lawson Lucy Farnik Laurence Aitchison 49 16 0 29 Jan 2025
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 86 14 0 18 Oct 2024
Efficient Dictionary Learning with Switch Sparse Autoencoders Anish Mudide Joshua Engels Eric J. Michaud Max Tegmark Christian Schroeder de Witt 52 13 0 10 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 84 16 0 10 Oct 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin James Wilken-Smith Tomáš Dulka Hardik Bhatnagar Joseph Bloom Joseph Isaac Bloom 67 33 0 22 Sep 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun Jordan K. Taylor Nicholas Goldowsky-Dill Lee D. Sharkey 50 39 0 17 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 80 145 0 22 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 103 145 0 28 Mar 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 420 2,081 0 31 Dec 2020
Aligning AI With Shared Human Values Dan Hendrycks Collin Burns Steven Basart Andrew Critch Jingkai Li D. Song Jacob Steinhardt 127 548 0 05 Aug 2020
"Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding Ben Zhou Daniel Khashabi Qiang Ning Dan Roth AIMat 77 196 0 06 Sep 2019
Neural Network Acceptability Judgments Alex Warstadt Amanpreet Singh Samuel R. Bowman 203 1,406 0 31 May 2018
XGBoost: A Scalable Tree Boosting System Tianqi Chen Carlos Guestrin 556 38,735 0 09 Mar 2016
Linear Algebraic Structure of Word Senses, with Applications to Polysemy Sanjeev Arora Yuanzhi Li Yingyu Liang Tengyu Ma Andrej Risteski 73 282 0 14 Jan 2016