Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks

30 June 2023

Papers citing "Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks"

46 / 46 papers shown

Title
Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence Matthieu Queloz 10 1 0 29 Jul 2025
Mechanistic Indicators of Understanding in Large Language Models Pierre Beckmann Matthieu Queloz 14 0 0 07 Jul 2025
Detecting High-Stakes Interactions with Activation Probes Alex McKenzie Urja Pawar Phil Blandfort William Bankes David M. Krueger Ekdeep Singh Lubana Dmitrii Krasheninnikov 217 0 0 12 Jun 2025
The Geometries of Truth Are Orthogonal Across Tasks Waiss Azizian Michael Kirchhof Eugène Ndiaye Louis Béthune Michal Klein Pierre Ablin Marco Cuturi 54 0 0 10 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety Seongmin Lee Aeree Cho Grace C. Kim ShengYun Peng Mansi Phute Duen Horng Chau LM&MA AI4CE 86 0 0 05 Jun 2025
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks Yuntai Bao Xuhong Zhang Tianyu Du Xinkui Zhao Zhengwen Feng Hao Peng Jianwei Yin HILM 68 0 0 01 Jun 2025
HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs Qing Li Jiahui Geng Zongxiong Chen Derui Zhu Yuxia Wang Congbo Ma Chenyang Lyu Fakhri Karray 44 0 0 30 May 2025
When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction Yuqing Yang Robin Jia KELM LRM 135 1 0 22 May 2025
Exploring the generalization of LLM truth directions on conversational formats Timour Ichmoukhamedov David Martens 96 1 0 14 May 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 188 1 0 24 Feb 2025
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger Wenjun Li Dexun Li Kuicai Dong Cong Zhang Hao Zhang Weiwen Liu Yasheng Wang Ruiming Tang Yong Liu LLMAG KELM 37 6 0 18 Feb 2025
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models Jaturong Kongmanee 99 1 0 25 Jan 2025
Representation in large language models Cameron C. Yetman 103 1 0 03 Jan 2025
HalluCana: Fixing LLM Hallucination with A Canary Lookahead Tianyi Li Erenay Dayanik Shubhi Tyagi Andrea Pierleoni HILM 143 1 0 10 Dec 2024
A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios Xiachong Feng Longxu Dou Ella Li Qinghao Wang Haoran Wang Yu Guo Chang Ma Lingpeng Kong AI4CE LM&Ro LM&MA ELM LLMAG 182 7 0 05 Dec 2024
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models Cameron Tice Philipp Alexander Kreer Nathan Helm-Burger Prithviraj Singh Shahani Fedor Ryzhenkov Jacob Haimes Felix Hofstätter Teun van der Weij 142 3 0 02 Dec 2024
Linear Probe Penalties Reduce LLM Sycophancy Henry Papadatos Rachel Freedman LLMSV 123 4 0 01 Dec 2024
Prompt-Guided Internal States for Hallucination Detection of Large Language Models Fujie Zhang Peiqi Yu Biao Yi Baolei Zhang Tong Li Zheli Liu HILM LRM 149 1 0 07 Nov 2024
Distinguishing Ignorance from Error in LLM Hallucinations Adi Simhi Jonathan Herzig Idan Szpektor Yonatan Belinkov HILM 101 4 0 29 Oct 2024
Chatting with Bots: AI, Speech Acts, and the Edge of Assertion Iwan Williams Tim Bayne 84 2 0 22 Oct 2024
Evaluating Language Model Character Traits Francis Rhys Ward Zejia Yang Alex Jackson Randy Brown Chandler Smith Grace Colverd Louis Thomson Raymond Douglas Patrik Bartak Andrew Rowan 74 0 0 05 Oct 2024
Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language Anthony Costarelli Mat Allen Severin Field 76 3 0 03 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Hadas Orgad Michael Toker Zorik Gekhman Roi Reichart Idan Szpektor Hadas Kotek Yonatan Belinkov HILM AIFin 151 61 0 03 Oct 2024
A Survey on the Honesty of Large Language Models Siheng Li Cheng Yang Taiqiang Wu Chufan Shi Yuji Zhang ... Jie Zhou Yujiu Yang Ngai Wong Xixin Wu Wai Lam HILM 114 8 0 27 Sep 2024
On the Relationship between Truth and Political Bias in Language Models S. Fulay William Brannon Shrestha Mohanty Cassandra Overney Elinor Poole-Dayan Deb Roy Jad Kabbara HILM 79 5 0 09 Sep 2024
Identifying the Source of Generation for Large Language Models Bumjin Park Jaesik Choi 79 0 0 05 Jul 2024
Truth is Universal: Robust Detection of Lies in LLMs Lennart Bürger Fred Hamprecht B. Nadler HILM 128 26 0 03 Jul 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 84 9 0 27 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 99 15 0 31 May 2024
CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control Huanshuo Liu Hao Zhang Zhijiang Guo Kuicai Dong Xiangyang Li Yi Quan Lee Cong Zhang Yong Liu 3DV 98 6 0 29 May 2024
An Assessment of Model-On-Model Deception Julius Heitkoetter Michael Gerovitch Laker Newhouse 78 4 0 10 May 2024
Truth-value judgment in language models: 'truth directions' are context sensitive Stefan F. Schouten Peter Bloem Ilia Markov Piek Vossen KELM 170 2 0 29 Apr 2024
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs Adi Simhi Jonathan Herzig Idan Szpektor Yonatan Belinkov HILM 127 13 0 15 Apr 2024
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions Erik Miehling Manish Nagireddy P. Sattigeri Elizabeth M. Daly David Piorkowski John T. Richards ALM 138 15 0 22 Mar 2024
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks Peter Hase Mohit Bansal Peter Clark Sarah Wiegreffe 168 35 0 12 Jan 2024
Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs Harvey Lederman Kyle Mahowald 112 13 0 10 Jan 2024
Challenges with unsupervised LLM knowledge discovery Sebastian Farquhar Vikrant Varma Zachary Kenton Johannes Gasteiger Vladimir Mikulik Rohin Shah 84 26 0 15 Dec 2023
Weakly Supervised Detection of Hallucinations in LLM Activations Miriam Rateike C. Cintas John Wamburu Tanya Akumu Skyler Speakman 90 14 0 05 Dec 2023
Honesty Is the Best Policy: Defining and Mitigating AI Deception Francis Rhys Ward Francesco Belardinelli Francesca Toni Tom Everitt 192 32 0 03 Dec 2023
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching James Campbell Richard Ren Phillip Guo HILM 89 21 0 25 Nov 2023
A Survey of Confidence Estimation and Calibration in Large Language Models Jiahui Geng Fengyu Cai Yuxia Wang Heinz Koeppl Preslav Nakov Iryna Gurevych UQCV 154 91 0 14 Nov 2023
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions Lei Huang Weijiang Yu Weitao Ma Weihong Zhong Zhangyin Feng ... Qianglong Chen Weihua Peng Xiaocheng Feng Bing Qin Ting Liu LRM HILM 160 1,046 0 09 Nov 2023
Self-Consistency of Large Language Models under Ambiguity Henning Bartsch Ole Jorgensen Domenic Rosati Jason Hoelscher-Obermaier Jacob Pfau HILM 81 10 0 20 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 170 248 0 10 Oct 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions Peter S. Park Simon Goldstein Aidan O'Gara Michael Chen Dan Hendrycks 89 170 0 28 Aug 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch Stephen Casper Jason Lin Joe Kwon Gatlen Culp Dylan Hadfield-Menell AAML 74 101 0 15 Jun 2023