Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

9 August 2024

Tom Lieberum

Senthooran Rajamanoharan

Rohin Shah

Papers citing "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2"

50 / 67 papers shown

Title
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 31 0 0 15 May 2025
An Introduction to Discrete Variational Autoencoders Alan Jeffares Liyuan Liu DRL BDL CML 41 0 0 15 May 2025
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders Dong Shu Xuansheng Wu Haiyan Zhao Jundong Li Ninghao Liu LLMSV 42 0 0 12 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering Yixiong Hao Ayush Panda Stepan Shabalin Sheikh Abdur Raheem Ali LLMSV 67 0 0 06 May 2025
Empirical Evaluation of Progressive Coding for Sparse Autoencoders Hans Peter Anders Søgaard 38 0 0 30 Apr 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He Jiadong Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J.N. Zhang Xipeng Qiu 70 0 0 29 Apr 2025
Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption Wenxiao Wang Parsa Hosseini S. Feizi LRM AI4CE 64 0 0 29 Apr 2025
Representation Learning on a Random Lattice Aryeh Brill OOD FAtt AI4CE 73 0 0 28 Apr 2025
Scaling sparse feature circuit finding for in-context learning Dmitrii Kharlapenko Shivalika Singh Fazl Barez Arthur Conmy Neel Nanda 26 0 0 18 Apr 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 51 1 0 17 Apr 2025
$SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs$ SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs Aashiq Muhamed Jacopo Bonato Mona Diab Virginia Smith MU 66 0 0 11 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric Yixin Cao Jiahao Ying Yansen Wang Xipeng Qiu Xuanjing Huang Yugang Jiang ELM 44 2 0 10 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Simon Lermen Mateusz Dziemian Natalia Pérez-Campanero Antolín 38 0 0 10 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Sewoong Lee Adam Davies Marc E. Canby J. Hockenmaier LLMSV 70 0 0 31 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 34 0 0 21 Mar 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 59 1 0 21 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering Zeqing He Zhibo Wang Huiyu Xu Kui Ren LLMSV 52 1 0 21 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Kola Ayonrinde Matthew Wearden Arthur Conmy Samuel Marks Neel Nanda MU 64 8 0 12 Mar 2025
Mixture of Experts Made Intrinsically Interpretable Xingyi Yang Constantin Venhoff Ashkan Khakzar Christian Schroeder de Witt P. Dokania Adel Bibi Philip Torr MoE 54 0 0 05 Mar 2025
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders Kristian Kuznetsov Laida Kushnareva Polina Druzhinina Anton Razzhigaev Anastasia Voznyuk Irina Piontkovskaya Evgeny Burnaev Serguei Barannikov 42 0 0 05 Mar 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach David D. Baek Max Tegmark LRM 80 3 0 05 Mar 2025
SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs Samir Abdaljalil Filippo Pallucchini Andrea Seveso Hasan Kurban Fabio Mercorio Erchin Serpedin HILM 77 0 0 04 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 47 5 0 03 Mar 2025
Interpreting CLIP with Hierarchical Sparse Autoencoders Vladimir Zaigrajew Hubert Baniecki P. Biecek 56 0 0 27 Feb 2025
Do Sparse Autoencoders Generalize? A Case Study of Answerability Lovis Heindrich Philip Torr Fazl Barez Veronika Thost 82 1 0 27 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features Bruno Puri Aakriti Jain Elena Golimblevskaia Patrick Kahardipraja Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 135 0 0 24 Feb 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Xuansheng Wu Jiayi Yuan Wenlin Yao Xiaoming Zhai Ninghao Liu LLMSV 80 4 0 24 Feb 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons Yuheng Chen Pengfei Cao Kang Liu Jun Zhao 50 0 0 18 Feb 2025
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards Xinyi Yang Liang Zeng Heng Dong Chao Yu X. Wu H. Yang Yu Wang Milind Tambe Tonghan Wang 76 2 0 18 Feb 2025
Multi-Faceted Multimodal Monosemanticity Hanqi Yan Xiangxiang Cui Lu Yin Paul Pu Liang Yulan He Yifei Wang 44 0 0 16 Feb 2025
SEER: Self-Explainability Enhancement of Large Language Models' Representations Guanxu Chen Dongrui Liu Tao Luo Jing Shao LRM MILM 67 1 0 07 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 49 1 0 09 Jan 2025
Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning? Mengyu Ye Tatsuki Kuribayashi Goro Kobayashi Jun Suzuki LRM 97 0 0 20 Dec 2024
VISTA: A Panoramic View of Neural Representations Tom White 72 0 0 03 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning Keito Kudo Yoichi Aoki Tatsuki Kuribayashi Shusaku Sone Masaya Taniguchi Ana Brassard Keisuke Sakaguchi Kentaro Inui ReLM LRM 77 0 0 02 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 85 12 0 21 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 98 1 0 20 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs Ruben Härle Felix Friedrich Manuel Brack Bjorn Deiseroth P. Schramowski Kristian Kersting 45 0 0 11 Nov 2024
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity Reduction Yushi Yang Filip Sondej Harry Mayne Adam Mahdi 29 1 0 10 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 42 5 0 07 Nov 2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features Sviatoslav Chalnev Matthew Siu Arthur Conmy LLMSV 55 16 0 04 Nov 2024
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity Yuqi Luo Chenyang Song Xu Han Yuxiao Chen Chaojun Xiao Zhiyuan Liu Maosong Sun 54 3 0 04 Nov 2024
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models Aashiq Muhamed Mona Diab Virginia Smith 45 2 0 01 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari 40 2 0 28 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Viacheslav Surkov Chris Wendler Mikhail Terekhov Justin Deschenaux Robert West Çağlar Gülçehre VLM 40 13 0 28 Oct 2024
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness Qi Zhang Yifei Wang Jingyi Cui Xiang Pan Qi Lei Stefanie Jegelka Yisen Wang AAML 34 1 0 27 Oct 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders Zhengfu He Wentao Shu Xuyang Ge Lingjie Chen Junxuan Wang ... Qipeng Guo Xuanjing Huang Zuxuan Wu Yu-Gang Jiang Xipeng Qiu 40 14 0 27 Oct 2024
Applying sparse autoencoders to unlearn knowledge in language models Eoin Farrell Yeu-Tong Lau Arthur Conmy MU 35 14 0 25 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 65 10 0 18 Oct 2024
The Persian Rug: solving toy models of superposition using large-scale symmetries Aditya Cowsik Kfir Dolev Alex Infanger 26 0 0 15 Oct 2024