Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.03407
Cited By
Detecting Strategic Deception Using Linear Probes
5 February 2025
Nicholas Goldowsky-Dill
Bilal Chughtai
Stefan Heimersheim
Marius Hobbhahn
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Detecting Strategic Deception Using Linear Probes"
7 / 7 papers shown
Title
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David M. Krueger
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
169
0
0
12 Jun 2025
Fine-Grained Interpretation of Political Opinions in Large Language Models
Jingyu Hu
Mengyue Yang
Mengnan Du
Weiru Liu
167
0
0
05 Jun 2025
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Chris Cundy
Adam Gleave
60
0
0
20 May 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Henk Tillman
Dan Mossing
LLMSV
100
1
0
28 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
113
0
0
10 Apr 2025
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Satvik Golechha
Adrià Garriga-Alonso
LLMAG
89
2
0
05 Apr 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
181
0
0
24 Feb 2025
1