Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.20271
Cited By
Investigating task-specific prompts and sparse autoencoders for activation monitoring
28 April 2025
Henk Tillman
Dan Mossing
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Investigating task-specific prompts and sparse autoencoders for activation monitoring"
5 / 5 papers shown
Title
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David M. Krueger
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
169
0
0
12 Jun 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
152
17
0
23 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant
Shan Chen
Kuleen Sasse
Hugo J. W. L. Aerts
Thomas Hartvigsen
Danielle S. Bitterman
103
6
0
17 Feb 2025
Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill
Bilal Chughtai
Stefan Heimersheim
Marius Hobbhahn
LLMSV
132
10
0
05 Feb 2025
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Hadas Orgad
Michael Toker
Zorik Gekhman
Roi Reichart
Idan Szpektor
Hadas Kotek
Yonatan Belinkov
HILM
AIFin
134
45
0
03 Oct 2024
1