ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.03341
  4. Cited By
Inference-Time Intervention: Eliciting Truthful Answers from a Language
  Model

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

6 June 2023
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
    KELM
    HILM
ArXivPDFHTML

Papers citing "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model"

50 / 411 papers shown
Title
A Theoretical Survey on Foundation Models
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
28
0
0
15 Oct 2024
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
Zhongxiang Sun
Xiaoxue Zang
Kai Zheng
Yang Song
Jun Xu
Xiao Zhang
Weijie Yu
Yang Song
Han Li
57
7
0
15 Oct 2024
LargePiG: Your Large Language Model is Secretly a Pointer Generator
LargePiG: Your Large Language Model is Secretly a Pointer Generator
Zhongxiang Sun
Zihua Si
Xiaoxue Zang
Kai Zheng
Yang Song
Xiao Zhang
Jun Xu
HILM
RALM
42
0
0
15 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
64
18
0
15 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
50
7
0
15 Oct 2024
Locking Down the Finetuned LLMs Safety
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
42
9
0
14 Oct 2024
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Yifan Luo
Zhennan Zhou
Meitan Wang
Bin Dong
39
0
0
14 Oct 2024
Safety-Aware Fine-Tuning of Large Language Models
Safety-Aware Fine-Tuning of Large Language Models
Hyeong Kyu Choi
Xuefeng Du
Yixuan Li
45
12
0
13 Oct 2024
Unraveling and Mitigating Safety Alignment Degradation of
  Vision-Language Models
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu
Chao Shang
Ling Liu
Nikolaos Pappas
Jie Ma
Neha Anna John
Srikanth Doss Kadarundalagi Raghuram Doss
Lluís Marquez
Miguel Ballesteros
Yassine Benajiba
42
4
0
11 Oct 2024
NoVo: Norm Voting off Hallucinations with Attention Heads in Large
  Language Models
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models
Zheng Yi Ho
Siyuan Liang
Sen Zhang
Yibing Zhan
Dacheng Tao
34
2
0
11 Oct 2024
Mitigating the Language Mismatch and Repetition Issues in LLM-based
  Machine Translation via Model Editing
Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing
Weichuan Wang
Zhaoyi Li
Defu Lian
Chen Ma
Linqi Song
Ying Wei
54
5
0
09 Oct 2024
Utilize the Flow before Stepping into the Same River Twice: Certainty
  Represented Knowledge Flow for Refusal-Aware Instruction Tuning
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning
Runchuan Zhu
Zhipeng Ma
Jiang Wu
Junyuan Gao
Jiaqi Wang
Dahua Lin
Conghui He
24
2
0
09 Oct 2024
On the Similarity of Circuits across Languages: a Case Study on the
  Subject-verb Agreement Task
On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task
Javier Ferrando
Marta R. Costa-jussá
20
5
0
09 Oct 2024
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Joris Postmus
Steven Abreu
LLMSV
145
1
0
09 Oct 2024
Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing
Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing
Zhuoran Zhang
Yong Li
Zijian Kan
Keyuan Cheng
Lijie Hu
Di Wang
KELM
29
4
0
08 Oct 2024
Attribute Controlled Fine-tuning for Large Language Models: A Case Study
  on Detoxification
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification
Tao Meng
Ninareh Mehrabi
Palash Goyal
Anil Ramakrishna
Aram Galstyan
Richard Zemel
Kai-Wei Chang
Rahul Gupta
Charith Peris
27
1
0
07 Oct 2024
Mechanistic?
Mechanistic?
Naomi Saphra
Sarah Wiegreffe
AI4CE
29
9
0
07 Oct 2024
Activation Scaling for Steering and Interpreting Language Models
Activation Scaling for Steering and Interpreting Language Models
Niklas Stoehr
Kevin Du
Vésteinn Snæbjarnarson
Robert West
Ryan Cotterell
Aaron Schein
LLMSV
LRM
37
4
0
07 Oct 2024
CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text
CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text
Jun Hirako
Ryohei Sasano
Koichi Takeda
39
2
0
06 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
46
6
0
04 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for
  Multiple-Choice QA
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
72
1
0
03 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Hadas Orgad
Michael Toker
Zorik Gekhman
Roi Reichart
Idan Szpektor
Hadas Kotek
Yonatan Belinkov
HILM
AIFin
61
29
0
03 Oct 2024
FactAlign: Long-form Factuality Alignment of Large Language Models
FactAlign: Long-form Factuality Alignment of Large Language Models
Chao-Wei Huang
Yun-Nung Chen
HILM
30
2
0
02 Oct 2024
Integrative Decoding: Improve Factuality via Implicit Self-consistency
Integrative Decoding: Improve Factuality via Implicit Self-consistency
Yi Cheng
Xiao Liang
Yeyun Gong
Wen Xiao
Song Wang
...
Wenjie Li
Jian Jiao
Qi Chen
Peng Cheng
Wayne Xiong
HILM
59
1
0
02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language
  Models
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee
Shaona Ghosh
Traian Rebedea
Christopher Parisien
LLMSV
37
4
0
02 Oct 2024
Attention layers provably solve single-location regression
Attention layers provably solve single-location regression
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
191
3
0
02 Oct 2024
Do Music Generation Models Encode Music Theory?
Do Music Generation Models Encode Music Theory?
Megan Wei
Michael Freeman
Chris Donahue
Chen Sun
MGen
28
4
0
01 Oct 2024
Style-Specific Neurons for Steering LLMs in Text Style Transfer
Style-Specific Neurons for Steering LLMs in Text Style Transfer
Wen Lai
Viktor Hangya
Alexander Fraser
36
8
0
01 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
62
10
0
30 Sep 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
59
3
0
30 Sep 2024
A Survey on the Honesty of Large Language Models
A Survey on the Honesty of Large Language Models
Siheng Li
Cheng Yang
Taiqiang Wu
Chufan Shi
Yuji Zhang
...
Jie Zhou
Yujiu Yang
Ngai Wong
Xixin Wu
Wai Lam
HILM
35
5
0
27 Sep 2024
Model-based Preference Optimization in Abstractive Summarization without
  Human Feedback
Model-based Preference Optimization in Abstractive Summarization without Human Feedback
Jaepill Choi
Kyubyung Chae
Jiwoo Song
Yohan Jo
Taesup Kim
29
1
0
27 Sep 2024
AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking
AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking
Michelle S. Lam
Fred Hohman
Dominik Moritz
Jeffrey P. Bigham
Kenneth Holstein
Mary Beth Kery
35
1
0
26 Sep 2024
HaloScope: Harnessing Unlabeled LLM Generations for Hallucination
  Detection
HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection
Xuefeng Du
Chaowei Xiao
Yixuan Li
HILM
37
20
0
26 Sep 2024
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in
  LLMs with Direction-Magnitude Perspective
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
Van-Cuong Pham
Thien Huu Nguyen
LLMSV
43
3
0
16 Sep 2024
Confidence Estimation for LLM-Based Dialogue State Tracking
Confidence Estimation for LLM-Based Dialogue State Tracking
Yi-Jyun Sun
Suvodip Dey
Dilek Z. Hakkani-Tür
Gokhan Tur
56
1
0
15 Sep 2024
Mitigating Hallucination in Visual-Language Models via Re-Balancing
  Contrastive Decoding
Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding
Xiaoyu Liang
Jiayuan Yu
Lianrui Mu
Jiedong Zhuang
Jiaqi Hu
Yuchen Yang
Jiangnan Ye
Lu Lu
Jian Chen
Haoji Hu
VLM
45
2
0
10 Sep 2024
On the Relationship between Truth and Political Bias in Language Models
On the Relationship between Truth and Political Bias in Language Models
S. Fulay
William Brannon
Shrestha Mohanty
Cassandra Overney
Elinor Poole-Dayan
Deb Roy
Jad Kabbara
HILM
34
2
0
09 Sep 2024
Attention Heads of Large Language Models: A Survey
Attention Heads of Large Language Models: A Survey
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Zhiyu Li
Zhiyu Li
LRM
58
22
0
05 Sep 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
Wei Chen
Zhen Huang
Liang Xie
Binbin Lin
Houqiang Li
...
Deng Cai
Yonggang Zhang
Wenxiao Wang
Xu Shen
Jieping Ye
57
6
0
03 Sep 2024
Towards Reliable Medical Question Answering: Techniques and Challenges
  in Mitigating Hallucinations in Language Models
Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models
Duy Khoa Pham
Bao Quoc Vo
LM&MA
HILM
31
4
0
25 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging
  Framework And Methods From Neuroscience
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
28
1
0
22 Aug 2024
Personality Alignment of Large Language Models
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
67
6
0
21 Aug 2024
LeCov: Multi-level Testing Criteria for Large Language Models
LeCov: Multi-level Testing Criteria for Large Language Models
Xuan Xie
Jiayang Song
Yuheng Huang
Da Song
Fuyuan Zhang
Felix Juefei-Xu
Lei Ma
ELM
31
0
0
20 Aug 2024
Training Language Models on the Knowledge Graph: Insights on
  Hallucinations and Their Detectability
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
Jiri Hron
Laura J. Culp
Gamaleldin F. Elsayed
Rosanne Liu
Ben Adlam
...
T. Warkentin
Lechao Xiao
Kelvin Xu
Jasper Snoek
Simon Kornblith
45
1
0
14 Aug 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
43
89
0
09 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
31
22
0
31 Jul 2024
Paying More Attention to Image: A Training-Free Method for Alleviating
  Hallucination in LVLMs
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Shiping Liu
Kecheng Zheng
Wei Chen
MLLM
52
34
0
31 Jul 2024
AgentPeerTalk: Empowering Students through Agentic-AI-Driven Discernment
  of Bullying and Joking in Peer Interactions in Schools
AgentPeerTalk: Empowering Students through Agentic-AI-Driven Discernment of Bullying and Joking in Peer Interactions in Schools
Aditya Paul
Chi Lok Yu
Eva Adelina Susanto
Nicholas Wai Long Lau
Gwenyth Isobel Meadows
LLMAG
35
3
0
27 Jul 2024
Cluster-norm for Unsupervised Probing of Knowledge
Cluster-norm for Unsupervised Probing of Knowledge
Walter Laurito
Sharan Maiya
Grégoire Dhimoïla
Owen
Owen Yeung
Kaarel Hänni
31
2
0
26 Jul 2024
Previous
123456789
Next