Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.03341
Cited By
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
6 June 2023
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Inference-Time Intervention: Eliciting Truthful Answers from a Language Model"
50 / 411 papers shown
Title
SH2: Self-Highlighted Hesitation Helps You Decode More Truthfully
Jushi Kai
Hai Hu
Zhouhan Lin
HILM
30
7
0
11 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
63
57
0
11 Jan 2024
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models
Junyi Li
Jie Chen
Ruiyang Ren
Xiaoxue Cheng
Wayne Xin Zhao
Jian-Yun Nie
Ji-Rong Wen
HILM
43
44
0
06 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
77
99
0
03 Jan 2024
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
S.M. Towhidul Islam Tonmoy
S. M. M. Zaman
Vinija Jain
Anku Rani
Vipula Rawte
Aman Chadha
Amitava Das
HILM
43
184
0
02 Jan 2024
Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models
Matthew Dahl
Varun Magesh
Mirac Suzgun
Daniel E. Ho
HILM
AILaw
25
73
0
02 Jan 2024
Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
Zhongzhi Chen
Xingwu Sun
Xianfeng Jiao
Fengzong Lian
Zhanhui Kang
Di Wang
Cheng-zhong Xu
HILM
34
28
0
29 Dec 2023
LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner States Analysis
Jinwen He
Yujia Gong
Kai-xiang Chen
Zijin Lin
Chengán Wei
Yue Zhao
32
3
0
27 Dec 2023
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky
Arman Cohan
41
2
0
26 Dec 2023
Alleviating Hallucinations of Large Language Models through Induced Hallucinations
Yue Zhang
Leyang Cui
Wei Bi
Shuming Shi
HILM
42
50
0
25 Dec 2023
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention
Zhen Tan
Tianlong Chen
Zhenyu Zhang
Huan Liu
52
14
0
22 Dec 2023
Challenges with unsupervised LLM knowledge discovery
Sebastian Farquhar
Vikrant Varma
Zachary Kenton
Johannes Gasteiger
Vladimir Mikulik
Rohin Shah
43
23
0
15 Dec 2023
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
50
262
0
14 Dec 2023
Alignment for Honesty
Yuqing Yang
Ethan Chern
Xipeng Qiu
Graham Neubig
Pengfei Liu
44
31
0
12 Dec 2023
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
28
163
0
09 Dec 2023
Improving Activation Steering in Language Models with Mean-Centring
Ole Jorgensen
Dylan R. Cope
Nandi Schoots
Murray Shanahan
LLMSV
16
28
0
06 Dec 2023
Eliciting Latent Knowledge from Quirky Language Models
Alex Troy Mallen
Madeline Brumley
Julia Kharchenko
Nora Belrose
HILM
RALM
KELM
21
25
0
02 Dec 2023
Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination
Haoqiang Kang
Xiao-Yang Liu
RALM
47
27
0
27 Nov 2023
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
James Campbell
Richard Ren
Phillip Guo
HILM
27
16
0
25 Nov 2023
Effective Large Language Model Adaptation for Improved Grounding and Citation Generation
Xi Ye
Ruoxi Sun
Sercan Ö. Arik
Tomas Pfister
HILM
34
25
0
16 Nov 2023
Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification
Haoqiang Kang
Juntong Ni
Huaxiu Yao
HILM
LRM
32
34
0
15 Nov 2023
Fine-tuning Language Models for Factuality
Katherine Tian
Eric Mitchell
Huaxiu Yao
Christopher D. Manning
Chelsea Finn
KELM
HILM
SyDa
19
168
0
14 Nov 2023
A Survey of Confidence Estimation and Calibration in Large Language Models
Jiahui Geng
Fengyu Cai
Yuxia Wang
Heinz Koeppl
Preslav Nakov
Iryna Gurevych
UQCV
41
56
0
14 Nov 2023
Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains
Joshua Clymer
Garrett Baker
Rohan Subramani
Sam Wang
22
6
0
13 Nov 2023
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
Sheng Liu
Haotian Ye
Lei Xing
James Y. Zou
26
86
0
11 Nov 2023
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
Lei Huang
Weijiang Yu
Weitao Ma
Weihong Zhong
Zhangyin Feng
...
Qianglong Chen
Weihua Peng
Xiaocheng Feng
Bing Qin
Ting Liu
LRM
HILM
56
744
0
09 Nov 2023
SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency
Jiaxin Zhang
Zhuohang Li
Kamalika Das
Bradley Malin
Kumar Sricharan
HILM
LRM
24
56
0
03 Nov 2023
Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism
Lang Cao
26
13
0
02 Nov 2023
Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation
Yixin Wan
Fanyou Wu
Weijie Xu
Srinivasan H. Sengamedu
HILM
24
5
0
28 Oct 2023
Personas as a Way to Model Truthfulness in Language Models
Nitish Joshi
Javier Rando
Abulhair Saparov
Najoung Kim
He He
HILM
40
28
0
27 Oct 2023
LUNA: A Model-Based Universal Analysis Framework for Large Language Models
Da Song
Xuan Xie
Jiayang Song
Derui Zhu
Yuheng Huang
Felix Juefei Xu
Lei Ma
ALM
37
3
0
22 Oct 2023
Understanding and Controlling a Maze-Solving Policy Network
Ulisse Mini
Peli Grietzer
Mrinank Sharma
Austin Meek
M. MacDiarmid
Alexander Matt Turner
14
15
0
12 Oct 2023
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Cunxiang Wang
Xiaoze Liu
Yuanhao Yue
Xiangru Tang
Tianhang Zhang
...
Linyi Yang
Jindong Wang
Xing Xie
Zheng-Wei Zhang
Yue Zhang
HILM
KELM
51
185
0
11 Oct 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
37
5
0
11 Oct 2023
Teaching Language Models to Hallucinate Less with Synthetic Tasks
Erik Jones
Hamid Palangi
Clarisse Simoes
Varun Chandrasekaran
Subhabrata Mukherjee
Arindam Mitra
Ahmed Hassan Awadallah
Ece Kamar
HILM
21
24
0
10 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
102
177
0
10 Oct 2023
Co-audit: tools to help humans double-check AI-generated content
Andrew D. Gordon
Carina Negreanu
J. Cambronero
Rasika Chakravarthy
Ian Drosos
...
Hannah Richardson
Advait Sarkar
Stephanie Simmons
Jack Williams
Ben Zorn
41
13
0
02 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
36
100
0
27 Sep 2023
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
Mert Yuksekgonul
Varun Chandrasekaran
Erik Jones
Suriya Gunasekar
Ranjita Naik
Hamid Palangi
Ece Kamar
Besmira Nushi
HILM
26
40
0
26 Sep 2023
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
24
177
0
26 Sep 2023
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
41
158
0
25 Sep 2023
Chain-of-Verification Reduces Hallucination in Large Language Models
S. Dhuliawala
M. Komeili
Jing Xu
Roberta Raileanu
Xian Li
Asli Celikyilmaz
Jason Weston
LRM
HILM
22
177
0
20 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
44
107
0
13 Sep 2023
Unsupervised Contrast-Consistent Ranking with Language Models
Niklas Stoehr
Pengxiang Cheng
Jing Wang
Daniel Preotiuc-Pietro
Rajarshi Bhowmik
ALM
31
11
0
13 Sep 2023
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Hongbin Ye
Tong Liu
Aijia Zhang
Wei Hua
Weiqiang Jia
HILM
48
77
0
13 Sep 2023
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Yung-Sung Chuang
Yujia Xie
Hongyin Luo
Yoon Kim
James R. Glass
Pengcheng He
HILM
35
150
0
07 Sep 2023
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang
Yafu Li
Leyang Cui
Deng Cai
Lemao Liu
...
Longyue Wang
A. Luu
Wei Bi
Freda Shi
Shuming Shi
RALM
LRM
HILM
48
524
0
03 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Neel Nanda
Andrew Lee
Martin Wattenberg
FAtt
MILM
48
147
0
02 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
30
141
0
28 Aug 2023
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation
Xinshuo Hu
Dongfang Li
Baotian Hu
Zihao Zheng
Zhenyu Liu
Hao Fei
KELM
MU
35
26
0
16 Aug 2023
Previous
1
2
3
4
5
6
7
8
9
Next