Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness?

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

27 November 2023

Dylan Hadfield-Menell

Papers citing "Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?"

15 / 15 papers shown

Title
What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction Eitan Wagner Omri Abend 36 0 0 04 May 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring Henk Tillman Dan Mossing LLMSV 50 0 0 28 Apr 2025
Continuum-Interaction-Driven Intelligence: Human-Aligned Neural Architecture via Crystallized Reasoning and Fluid Generation Pengcheng Zhou Zhiqiang Nie Haochen Li 48 0 0 12 Apr 2025
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation Dongryeol Lee Yerin Hwang Yongil Kim Joonsuk Park Kyomin Jung ELM 72 5 0 28 Oct 2024
Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation Yiming Wang Pei Zhang Baosong Yang Derek F. Wong Rui-cang Wang LRM 50 4 0 17 Oct 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis Daoyang Li Mingyu Jin Qingcheng Zeng Mengnan Du 60 2 0 22 Sep 2024
Can Language Model Understand Word Semantics as A Chatbot? An Empirical Study of Language Model Internal External Mismatch Jinman Zhao Xueyan Zhang Xingyu Yue Weizhe Chen Zifan Qian Ruiyu Wang LRM 34 0 0 21 Sep 2024
LLM Internal States Reveal Hallucination Risk Faced With a Query Ziwei Ji Delong Chen Etsuko Ishii Samuel Cahyawijaya Yejin Bang Bryan Wilie Pascale Fung HILM LRM 36 19 0 03 Jul 2024
Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell Taiming Lu Muhan Gao Kuai Yu Adam Byerly Daniel Khashabi 49 11 0 20 Jun 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward Xuan Xie Jiayang Song Zhehua Zhou Yuheng Huang Da Song Lei Ma OffRL 48 6 0 12 Apr 2024
Uncovering Latent Human Wellbeing in Language Model Embeddings Pedro Freire ChengCheng Tan Adam Gleave Dan Hendrycks Scott Emmons 36 1 0 19 Feb 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 102 169 0 10 Oct 2023
Truthful AI: Developing and governing AI that does not lie Owain Evans Owen Cotton-Barratt Lukas Finnveden Adam Bales Avital Balwit Peter Wills Luca Righetti William Saunders HILM 233 109 0 13 Oct 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021
Language Models as Knowledge Bases? Fabio Petroni Tim Rocktaschel Patrick Lewis A. Bakhtin Yuxiang Wu Alexander H. Miller Sebastian Riedel KELM AI4MH 415 2,586 0 03 Sep 2019