ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.03407
  4. Cited By
Detecting Strategic Deception Using Linear Probes

Detecting Strategic Deception Using Linear Probes

5 February 2025
Nicholas Goldowsky-Dill
Bilal Chughtai
Stefan Heimersheim
Marius Hobbhahn
    LLMSV
ArXiv (abs)PDFHTML

Papers citing "Detecting Strategic Deception Using Linear Probes"

7 / 7 papers shown
Title
Detecting High-Stakes Interactions with Activation Probes
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David M. Krueger
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
169
0
0
12 Jun 2025
Fine-Grained Interpretation of Political Opinions in Large Language Models
Jingyu Hu
Mengyue Yang
Mengnan Du
Weiru Liu
167
0
0
05 Jun 2025
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Chris Cundy
Adam Gleave
58
0
0
20 May 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Henk Tillman
Dan Mossing
LLMSV
100
1
0
28 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
113
0
0
10 Apr 2025
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
Satvik Golechha
Adrià Garriga-Alonso
LLMAG
89
3
0
05 Apr 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
181
0
0
24 Feb 2025
1