Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.15131
Cited By
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
25 November 2023
James Campbell
Richard Ren
Phillip Guo
HILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching"
4 / 4 papers shown
Title
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
Weixiang Zhao
Jiahe Guo
Yang Deng
Tongtong Wu
Wenxuan Zhang
...
Yanyan Zhao
Wanxiang Che
Bing Qin
Tat-Seng Chua
Ting Liu
LRM
12
0
0
21 May 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
90
0
0
05 Mar 2025
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Sihang Li
Yongbin Li
59
5
0
17 Oct 2024
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
49
9
0
31 May 2024
1