ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.15131
  4. Cited By
Localizing Lying in Llama: Understanding Instructed Dishonesty on
  True-False Questions Through Prompting, Probing, and Patching

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

25 November 2023
James Campbell
Richard Ren
Phillip Guo
    HILM
ArXivPDFHTML

Papers citing "Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching"

4 / 4 papers shown
Title
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners
Weixiang Zhao
Jiahe Guo
Yang Deng
Tongtong Wu
Wenxuan Zhang
...
Yanyan Zhao
Wanxiang Che
Bing Qin
Tat-Seng Chua
Ting Liu
LRM
12
0
0
21 May 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
90
0
0
05 Mar 2025
On the Role of Attention Heads in Large Language Model Safety
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Sihang Li
Yongbin Li
59
5
0
17 Oct 2024
Standards for Belief Representations in LLMs
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
49
9
0
31 May 2024
1