Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

12 May 2025

ArXiv (abs)PDF HTML

Papers citing "Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders"

9 / 9 papers shown

Title
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Xuansheng Wu Jiayi Yuan Wenlin Yao Xiaoming Zhai Ninghao Liu LLMSV 147 10 0 24 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Z. He Haiyan Zhao Yiran Qiao Fan Yang Ali Payani Jing Ma Jundong Li LLMSV 110 9 0 17 Feb 2025
Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts Shruti Joshi Andrea Dittadi Sébastien Lachapelle Dhanya Sridhar LLMSV 81 2 0 14 Feb 2025
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders Kola Ayonrinde 68 5 0 04 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari 64 5 0 28 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Yu Zhao Alessio Devoto Giwon Hong Xiaotang Du Aryo Pradipta Gema Hongru Wang Xuanli He Kam-Fai Wong Pasquale Minervini KELM LLMSV 101 25 0 21 Oct 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun Jordan K. Taylor Nicholas Goldowsky-Dill Lee D. Sharkey 70 39 0 17 May 2024
SQuAD: 100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar Jian Zhang Konstantin Lopyrev Percy Liang RALM 286 8,134 0 16 Jun 2016
Linear Algebraic Structure of Word Senses, with Applications to Polysemy Sanjeev Arora Yuanzhi Li Yingyu Liang Tengyu Ma Andrej Risteski 83 283 0 14 Jan 2016