Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2504.20271
Cited By
Investigating task-specific prompts and sparse autoencoders for activation monitoring
28 April 2025
Henk Tillman
Dan Mossing
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Investigating task-specific prompts and sparse autoencoders for activation monitoring"
11 / 11 papers shown
Title
The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch
Samuel Dower
Adrians Skapars
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
64
0
0
21 Nov 2025
Red-teaming Activation Probes using Prompted LLMs
Phil Blandfort
Robert Graham
AAML
LLMSV
271
0
0
01 Nov 2025
Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
Xiaodan Li
Mengjie Wu
Yao Zhu
Yunna Lv
YueFeng Chen
Cen Chen
Jianmei Guo
H. Xue
KELM
131
0
0
09 Oct 2025
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
James Oldfield
Juil Sock
Ioannis Patras
Adel Bibi
Fazl Barez
96
0
0
30 Sep 2025
Real-Time Detection of Hallucinated Entities in Long-Form Generation
Oscar Obeso
Andy Arditi
Javier Ferrando
Joshua Freeman
Cameron Holmes
Neel Nanda
HILM
133
5
0
26 Aug 2025
Persona Features Control Emergent Misalignment
Miles Wang
Tom Dupré la Tour
Olivia Watkins
Alex Makelov
Ryan A. Chi
...
Jeffrey Wang
Achyuta Rajaram
Johannes Heidecke
Tejal Patwardhan
Dan Mossing
156
14
0
24 Jun 2025
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David M. Krueger
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
499
9
0
12 Jun 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
312
39
0
23 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant
Shan Chen
Kuleen Sasse
Hugo J. W. L. Aerts
Thomas Hartvigsen
Danielle S. Bitterman
214
13
0
17 Feb 2025
Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill
Bilal Chughtai
Stefan Heimersheim
Marius Hobbhahn
LLMSV
392
32
0
05 Feb 2025
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
International Conference on Learning Representations (ICLR), 2024
Hadas Orgad
Michael Toker
Zorik Gekhman
Roi Reichart
Idan Szpektor
Hadas Kotek
Yonatan Belinkov
HILM
AIFin
555
104
0
03 Oct 2024
1