Interpreting Attention Layer Outputs with Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

25 June 2024

Robert Krzyzanowski

Joseph Isaac Bloom

Papers citing "Interpreting Attention Layer Outputs with Sparse Autoencoders"

12 / 12 papers shown

Title
FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation Chaitali Bhattacharyya Yeseong Kim 45 0 0 01 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang Junzhe Zhang Xipeng Qiu 70 0 0 29 Apr 2025
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models Zhihua Tian Sirun Nan Ming Xu Shengfang Zhai Wenjie Qu Jian Liu Kui Ren Ruoxi Jia Jiaheng Zhang DiffM 93 1 0 12 Mar 2025
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Bartosz Cywiñski Kamil Deja DiffM 63 6 0 29 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 49 1 0 09 Jan 2025
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 98 1 0 20 Nov 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 26 3 0 06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 39 20 0 19 Feb 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 120 0 30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 318 0 21 Sep 2022