Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

28 November 2024

Papers citing "Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks"

5 / 5 papers shown

Title
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations Aaron Jiaxun Li Suraj Srinivas Usha Bhalla Himabindu Lakkaraju AAML 157 0 0 21 May 2025
Ensembling Sparse Autoencoders Soham Gadgil Chris Lin Su-In Lee 87 0 0 21 May 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 129 13 0 21 Mar 2025
Discovering Chunks in Neural Embeddings for Interpretability Shuchen Wu Stephan Alaniz Eric Schulz Zeynep Akata 109 0 0 03 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 126 3 0 09 Jan 2025