Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.06254
Cited By
v1
v2 (latest)
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
9 January 2025
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words"
30 / 30 papers shown
Title
On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond
Jingyi Cui
Qi Zhang
Yifei Wang
Yisen Wang
12
0
0
19 Jun 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
157
0
0
21 May 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Sewoong Lee
Adam Davies
Marc E. Canby
Julia Hockenmaier
LLMSV
116
0
0
31 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
MU
166
23
0
12 Mar 2025
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Adam Karvonen
Can Rager
Samuel Marks
Neel Nanda
89
6
0
28 Nov 2024
RedPajama: an Open Dataset for Training Large Language Models
Maurice Weber
Daniel Y. Fu
Quentin Anthony
Yonatan Oren
S. Adams
...
Tri Dao
Percy Liang
Christopher Ré
Irina Rish
Ce Zhang
247
87
0
19 Nov 2024
One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models
Viacheslav Surkov
Chris Wendler
Antonio Mari
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
David Bau
VLM
128
14
0
28 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
112
21
0
10 Oct 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary
Atticus Geiger
87
19
0
05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
121
128
0
09 Aug 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Senthooran Rajamanoharan
Tom Lieberum
Nicolas Sonnerat
Arthur Conmy
Vikrant Varma
János Kramár
Neel Nanda
87
105
0
19 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
87
24
0
25 Jun 2024
Scaling and evaluating sparse autoencoders
Leo Gao
Tom Dupré la Tour
Henk Tillman
Gabriel Goh
Rajan Troll
Alec Radford
Ilya Sutskever
Jan Leike
Jeffrey Wu
100
163
0
06 Jun 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
77
41
0
14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Tom Lieberum
Vikrant Varma
János Kramár
Rohin Shah
Neel Nanda
RALM
82
94
0
24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
137
158
0
22 Apr 2024
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models
Haz Sameen Shahgir
Khondker Salman Sayeed
Abhik Bhattacharjee
Wasi Uddin Ahmad
Yue Dong
Rifat Shahriyar
VLM
MLLM
99
14
0
23 Mar 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
Zhengfu He
Xuyang Ge
Qiong Tang
Tianxiang Sun
Qinyuan Cheng
Xipeng Qiu
94
22
0
19 Feb 2024
Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks
Gouki Minegishi
Yusuke Iwasawa
Yutaka Matsuo
65
3
0
30 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
141
449
0
15 Sep 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum
Matthew Rahtz
János Kramár
Neel Nanda
G. Irving
Rohin Shah
Vladimir Mikulik
103
115
0
18 Jul 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
282
218
0
02 May 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman
Hailey Schoelkopf
Quentin G. Anthony
Herbie Bradley
Kyle O'Brien
...
USVSN Sai Prashanth
Edward Raff
Aviya Skowron
Lintang Sutawika
Oskar van der Wal
156
1,311
0
03 Apr 2023
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
115
451
0
12 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
320
563
0
01 Nov 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power
Yuri Burda
Harrison Edwards
Igor Babuschkin
Vedant Misra
115
366
0
06 Jan 2022
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization
Alessandro Raganato
Tommaso Pasini
Jose Camacho-Collados
Mohammad Taher Pilehvar
90
65
0
13 Oct 2020
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Alex Jinpeng Wang
Yada Pruksachatkun
Nikita Nangia
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
428
2,331
0
02 May 2019
JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks
N. Benjamin Erichson
Z. Yao
Michael W. Mahoney
AAML
69
24
0
07 Apr 2019
WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
Mohammad Taher Pilehvar
Jose Camacho-Collados
227
493
0
28 Aug 2018
1