Detecting Strategic Deception Using Linear Probes

5 February 2025

Papers citing "Detecting Strategic Deception Using Linear Probes"

7 / 7 papers shown

Title
Detecting High-Stakes Interactions with Activation Probes Alex McKenzie Urja Pawar Phil Blandfort William Bankes David M. Krueger Ekdeep Singh Lubana Dmitrii Krasheninnikov 169 0 0 12 Jun 2025
Fine-Grained Interpretation of Political Opinions in Large Language Models Jingyu Hu Mengyue Yang Mengnan Du Weiru Liu 167 0 0 05 Jun 2025
Preference Learning with Lie Detectors can Induce Honesty or Evasion Chris Cundy Adam Gleave 60 0 0 20 May 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring Henk Tillman Dan Mossing LLMSV 100 1 0 28 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Simon Lermen Mateusz Dziemian Natalia Pérez-Campanero Antolín 113 0 0 10 Apr 2025
Among Us: A Sandbox for Measuring and Detecting Agentic Deception Satvik Golechha Adrià Garriga-Alonso LLMAG 89 2 0 05 Apr 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 181 0 0 24 Feb 2025