Investigating task-specific prompts and sparse autoencoders for activation monitoring

28 April 2025

Papers citing "Investigating task-specific prompts and sparse autoencoders for activation monitoring"

11 / 11 papers shown

Title
The Impact of Off-Policy Training Data on Probe Generalisation Nathalie Kirch Samuel Dower Adrians Skapars Ekdeep Singh Lubana Dmitrii Krasheninnikov 68 0 0 21 Nov 2025
Red-teaming Activation Probes using Prompted LLMs Phil Blandfort Robert Graham AAML LLMSV 283 0 0 01 Nov 2025
Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection Xiaodan Li Mengjie Wu Yao Zhu Yunna Lv YueFeng Chen Cen Chen Jianmei Guo H. Xue KELM 131 0 0 09 Oct 2025
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models James Oldfield Juil Sock Ioannis Patras Adel Bibi Fazl Barez 104 0 0 30 Sep 2025
Real-Time Detection of Hallucinated Entities in Long-Form Generation Oscar Obeso Andy Arditi Javier Ferrando Joshua Freeman Cameron Holmes Neel Nanda HILM 133 5 0 26 Aug 2025
Persona Features Control Emergent Misalignment Miles Wang Tom Dupré la Tour Olivia Watkins Alex Makelov Ryan A. Chi ... Jeffrey Wang Achyuta Rajaram Johannes Heidecke Tejal Patwardhan Dan Mossing 172 14 0 24 Jun 2025
Detecting High-Stakes Interactions with Activation Probes Alex McKenzie Urja Pawar Phil Blandfort William Bankes David M. Krueger Ekdeep Singh Lubana Dmitrii Krasheninnikov 499 10 0 12 Jun 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 312 40 0 23 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant Shan Chen Kuleen Sasse Hugo J. W. L. Aerts Thomas Hartvigsen Danielle S. Bitterman 214 13 0 17 Feb 2025
Detecting Strategic Deception Using Linear Probes Nicholas Goldowsky-Dill Bilal Chughtai Stefan Heimersheim Marius Hobbhahn LLMSV 392 33 0 05 Feb 2025
LLMs Know More Than They Show: On the Intrinsic Representation of LLM HallucinationsInternational Conference on Learning Representations (ICLR), 2024 Hadas Orgad Michael Toker Zorik Gekhman Roi Reichart Idan Szpektor Hadas Kotek Yonatan Belinkov HILM AIFin 567 104 0 03 Oct 2024