v1v2 (latest)

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

9 January 2025

Papers citing "Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words"

30 / 30 papers shown

Title
On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond Jingyi Cui Qi Zhang Yifei Wang Yisen Wang 12 0 0 19 Jun 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations Aaron Jiaxun Li Suraj Srinivas Usha Bhalla Himabindu Lakkaraju AAML 157 0 0 21 May 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Sewoong Lee Adam Davies Marc E. Canby Julia Hockenmaier LLMSV 116 0 0 31 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Matthew Wearden Arthur Conmy Arthur Conmy Samuel Marks Neel Nanda MU 166 23 0 12 Mar 2025
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks Adam Karvonen Can Rager Samuel Marks Neel Nanda 89 6 0 28 Nov 2024
RedPajama: an Open Dataset for Training Large Language Models Maurice Weber Daniel Y. Fu Quentin Anthony Yonatan Oren S. Adams ... Tri Dao Percy Liang Christopher Ré Irina Rish Ce Zhang 247 87 0 19 Nov 2024
One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models Viacheslav Surkov Chris Wendler Antonio Mari Mikhail Terekhov Justin Deschenaux Robert West Çağlar Gülçehre David Bau VLM 128 14 0 28 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 112 21 0 10 Oct 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Maheep Chaudhary Atticus Geiger 87 19 0 05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Tom Lieberum Senthooran Rajamanoharan Arthur Conmy Lewis Smith Nicolas Sonnerat Vikrant Varma János Kramár Anca Dragan Rohin Shah Neel Nanda 121 128 0 09 Aug 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders Senthooran Rajamanoharan Tom Lieberum Nicolas Sonnerat Arthur Conmy Vikrant Varma János Kramár Neel Nanda 87 105 0 19 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane Robert Krzyzanowski Joseph Isaac Bloom Arthur Conmy Neel Nanda MILM 87 24 0 25 Jun 2024
Scaling and evaluating sparse autoencoders Leo Gao Tom Dupré la Tour Henk Tillman Gabriel Goh Rajan Troll Alec Radford Ilya Sutskever Jan Leike Jeffrey Wu 100 163 0 06 Jun 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov Georg Lange Neel Nanda 77 41 0 14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders Senthooran Rajamanoharan Arthur Conmy Lewis Smith Tom Lieberum Vikrant Varma János Kramár Rohin Shah Neel Nanda RALM 82 94 0 24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 137 158 0 22 Apr 2024
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models Haz Sameen Shahgir Khondker Salman Sayeed Abhik Bhattacharjee Wasi Uddin Ahmad Yue Dong Rifat Shahriyar VLM MLLM 99 14 0 23 Mar 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 94 22 0 19 Feb 2024
Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks Gouki Minegishi Yusuke Iwasawa Yutaka Matsuo 65 3 0 30 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 141 449 0 15 Sep 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla Tom Lieberum Matthew Rahtz János Kramár Neel Nanda G. Irving Rohin Shah Vladimir Mikulik 103 115 0 18 Jul 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 282 218 0 02 May 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling Stella Biderman Hailey Schoelkopf Quentin G. Anthony Herbie Bradley Kyle O'Brien ... USVSN Sai Prashanth Edward Raff Aviya Skowron Lintang Sutawika Oskar van der Wal 156 1,311 0 03 Apr 2023
Progress measures for grokking via mechanistic interpretability Neel Nanda Lawrence Chan Tom Lieberum Jess Smith Jacob Steinhardt 115 451 0 12 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 320 563 0 01 Nov 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power Yuri Burda Harrison Edwards Igor Babuschkin Vedant Misra 115 366 0 06 Jan 2022
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization Alessandro Raganato Tommaso Pasini Jose Camacho-Collados Mohammad Taher Pilehvar 90 65 0 13 Oct 2020
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems Alex Jinpeng Wang Yada Pruksachatkun Nikita Nangia Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 428 2,331 0 02 May 2019
JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks N. Benjamin Erichson Z. Yao Michael W. Mahoney AAML 69 24 0 07 Apr 2019
WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations Mohammad Taher Pilehvar Jose Camacho-Collados 227 493 0 28 Aug 2018