Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

25 November 2023

Papers citing "Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching"

4 / 4 papers shown

Title
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners Weixiang Zhao Jiahe Guo Yang Deng Tongtong Wu Wenxuan Zhang ... Yanyan Zhao Wanxiang Che Bing Qin Tat-Seng Chua Ting Liu LRM 12 0 0 21 May 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Richard Ren Arunim Agarwal Mantas Mazeika Cristina Menghini Robert Vacareanu ... Matias Geralnik Adam Khoja Dean Lee Summer Yue Dan Hendrycks HILM ALM 90 0 0 05 Mar 2025
On the Role of Attention Heads in Large Language Model Safety Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Sihang Li Yongbin Li 59 5 0 17 Oct 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 49 9 0 31 May 2024