SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

17 February 2025

Papers citing "SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models"

1 / 1 papers shown

Title
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders Dong Shu Xuansheng Wu Haiyan Zhao Mengnan Du Ninghao Liu LLMSV 40 0 0 12 May 2025