ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.13548
  4. Cited By
Towards Understanding Sycophancy in Language Models
v1v2v3v4 (latest)

Towards Understanding Sycophancy in Language Models

20 October 2023
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
Samuel R. Bowman
Newton Cheng
Esin Durmus
Zac Hatfield-Dodds
Scott R. Johnston
Shauna Kravec
Timothy Maxwell
Sam McCandlish
Kamal Ndousse
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
ArXiv (abs)PDFHTML

Papers citing "Towards Understanding Sycophancy in Language Models"

50 / 178 papers shown
Title
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
Tom A. Lamb
Adam Davies
Alasdair Paren
Philip Torr
Francesco Pinto
113
0
0
30 Oct 2024
Distinguishing Ignorance from Error in LLM Hallucinations
Distinguishing Ignorance from Error in LLM Hallucinations
Adi Simhi
Jonathan Herzig
Idan Szpektor
Yonatan Belinkov
HILM
82
4
0
29 Oct 2024
An Auditing Test To Detect Behavioral Shift in Language Models
An Auditing Test To Detect Behavioral Shift in Language Models
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
60
0
0
25 Oct 2024
Teaching Models to Balance Resisting and Accepting Persuasion
Teaching Models to Balance Resisting and Accepting Persuasion
Elias Stengel-Eskin
Peter Hase
Joey Tianyi Zhou
MU
88
5
0
18 Oct 2024
Accounting for Sycophancy in Language Model Uncertainty Estimation
Accounting for Sycophancy in Language Model Uncertainty Estimation
Anthony Sicilia
Mert Inan
Malihe Alikhani
60
2
0
17 Oct 2024
Anchored Alignment for Self-Explanations Enhancement
Anchored Alignment for Self-Explanations Enhancement
Luis Felipe Villa-Arenas
Ata Nizamoglu
Qianli Wang
Sebastian Möller
Vera Schmitt
54
0
0
17 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
114
3
0
17 Oct 2024
STRUX: An LLM for Decision-Making with Structured Explanations
STRUX: An LLM for Decision-Making with Structured Explanations
Yiming Lu
Yebowen Hu
H. Foroosh
Wei Jin
Fei Liu
82
0
0
16 Oct 2024
Conformity in Large Language Models
Conformity in Large Language Models
Xiaochen Zhu
Caiqi Zhang
Tom Stafford
Nigel Collier
Andreas Vlachos
105
0
0
16 Oct 2024
Are UFOs Driving Innovation? The Illusion of Causality in Large Language
  Models
Are UFOs Driving Innovation? The Illusion of Causality in Large Language Models
María Victoria Carro
Francisca Gauna Selasco
Denise Alejandra Mester
Mario Leiva
LRM
57
0
0
15 Oct 2024
Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs
Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs
Shuo Li
Tao Ji
Xiaoran Fan
Linsheng Lu
L. Yang
...
Yansen Wang
Xiaohui Zhao
Tao Gui
Qi Zhang
Xuanjing Huang
73
1
0
15 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
141
28
0
15 Oct 2024
Locking Down the Finetuned LLMs Safety
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
96
14
0
14 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
107
20
0
10 Oct 2024
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
Yu Ying Chiu
Liwei Jiang
Yejin Choi
117
9
0
03 Oct 2024
Moral Alignment for LLM Agents
Moral Alignment for LLM Agents
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
117
8
0
02 Oct 2024
Wait, but Tylenol is Acetaminophen... Investigating and Improving
  Language Models' Ability to Resist Requests for Misinformation
Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation
Shan Chen
Mingye Gao
Kuleen Sasse
Thomas Hartvigsen
Brian Anthony
Lizhou Fan
Hugo J. W. L. Aerts
Jack Gallifant
Danielle S. Bitterman
LM&MA
74
1
0
30 Sep 2024
A Survey on the Honesty of Large Language Models
A Survey on the Honesty of Large Language Models
Siheng Li
Cheng Yang
Taiqiang Wu
Chufan Shi
Yuji Zhang
...
Jie Zhou
Yujiu Yang
Ngai Wong
Xixin Wu
Wai Lam
HILM
87
6
0
27 Sep 2024
AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking
AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking
Michelle S. Lam
Fred Hohman
Dominik Moritz
Jeffrey P. Bigham
Kenneth Holstein
Mary Beth Kery
67
1
0
26 Sep 2024
Language Models Learn to Mislead Humans via RLHF
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen
Ruiqi Zhong
Akbir Khan
Ethan Perez
Jacob Steinhardt
Minlie Huang
Samuel R. Bowman
He He
Shi Feng
81
43
0
19 Sep 2024
PersonaFlow: Boosting Research Ideation with LLM-Simulated Expert
  Personas
PersonaFlow: Boosting Research Ideation with LLM-Simulated Expert Personas
Yiren Liu
Pranav Sharma
Mehul Jitendra Oswal
Haijun Xia
Yun Huang
80
13
0
19 Sep 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
Wei Chen
Zhen Huang
Liang Xie
Binbin Lin
Houqiang Li
...
Deng Cai
Yonggang Zhang
Wenxiao Wang
Xu Shen
Jieping Ye
117
10
0
03 Sep 2024
Conversational Complexity for Assessing Risk in Large Language Models
Conversational Complexity for Assessing Risk in Large Language Models
John Burden
Manuel Cebrian
José Hernández-Orallo
83
2
0
02 Sep 2024
How will advanced AI systems impact democracy?
How will advanced AI systems impact democracy?
Christopher Summerfield
Lisa Argyle
Michiel Bakker
Teddy Collins
Esin Durmus
...
Elizabeth Seger
Divya Siddarth
Henrik Skaug Sætra
MH Tessler
M. Botvinick
101
5
0
27 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging
  Framework And Methods From Neuroscience
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
84
2
0
22 Aug 2024
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework
Yunpu Zhao
Rui Zhang
Junbin Xiao
Changxin Ke
Ruibo Hou
Yifan Hao
Qi Guo
51
4
0
21 Aug 2024
How Susceptible are LLMs to Influence in Prompts?
How Susceptible are LLMs to Influence in Prompts?
Sotiris Anagnostidis
Jannis Bulian
LRM
85
22
0
17 Aug 2024
Conversational AI Powered by Large Language Models Amplifies False
  Memories in Witness Interviews
Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews
Samantha W. T. Chan
Pat Pataranutaporn
Aditya Suri
W. Zulfikar
Pattie Maes
Elizabeth F. Loftus
HILM
41
5
0
08 Aug 2024
On the Generalization of Preference Learning with DPO
On the Generalization of Preference Learning with DPO
Shawn Im
Yixuan Li
66
2
0
06 Aug 2024
Dissecting Dissonance: Benchmarking Large Multimodal Models Against
  Self-Contradictory Instructions
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
Jin Gao
Lei Gan
Yuankai Li
Yixin Ye
Dequan Wang
50
3
0
02 Aug 2024
Meta-Rewarding Language Models: Self-Improving Alignment with
  LLM-as-a-Meta-Judge
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Tianhao Wu
Weizhe Yuan
O. Yu. Golovneva
Jing Xu
Yuandong Tian
Jiantao Jiao
Jason Weston
Sainbayar Sukhbaatar
ALMKELMLRM
125
95
0
28 Jul 2024
Blockchain for Large Language Model Security and Safety: A Holistic
  Survey
Blockchain for Large Language Model Security and Safety: A Holistic Survey
Caleb Geren
Amanda Board
Gaby G. Dagher
Tim Andersen
Jun Zhuang
99
6
0
26 Jul 2024
GermanPartiesQA: Benchmarking Commercial Large Language Models for
  Political Bias and Sycophancy
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy
Jan Batzner
Volker Stocker
Stefan Schmid
Gjergji Kasneci
57
3
0
25 Jul 2024
Distilling System 2 into System 1
Distilling System 2 into System 1
Ping Yu
Jing Xu
Jason Weston
Ilia Kulikov
OffRLLRM
124
94
0
08 Jul 2024
Large Language Model Agents for Improving Engagement with Behavior
  Change Interventions: Application to Digital Mindfulness
Large Language Model Agents for Improving Engagement with Behavior Change Interventions: Application to Digital Mindfulness
Harsh Kumar
Suhyeon Yoo
Angela M. Zavaleta Bernuy
Jiakai Shi
Huayin Luo
Joseph Jay Williams
Anastasia Kuzminykh
Ashton Anderson
Rachel Kornfield
70
2
0
03 Jul 2024
Monitoring Latent World States in Language Models with Propositional
  Probes
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
81
14
0
27 Jun 2024
Fundamental Problems With Model Editing: How Should Rational Belief
  Revision Work in LLMs?
Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
Peter Hase
Thomas Hofweber
Xiang Zhou
Elias Stengel-Eskin
Joey Tianyi Zhou
KELMLRM
94
17
0
27 Jun 2024
AI Alignment through Reinforcement Learning from Human Feedback?
  Contradictions and Limitations
AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
Adam Dahlgren Lindstrom
Leila Methnani
Lea Krause
Petter Ericson
Ínigo Martínez de Rituerto de Troya
Dimitri Coelho Mollo
Roel Dobbe
ALM
74
2
0
26 Jun 2024
LLM Targeted Underperformance Disproportionately Impacts Vulnerable
  Users
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
Elinor Poole-Dayan
Deb Roy
Jad Kabbara
54
5
0
25 Jun 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
108
23
0
24 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement
  on Multilingual and Multi-Cultural Data
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
80
9
0
21 Jun 2024
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Yalan Qin
Chongye Guo
Borong Zhang
Boyuan Chen
Josef Dai
...
Kaile Wang
Boxuan Li
Sirui Han
Yike Guo
Yaodong Yang
84
51
0
20 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILMALM
131
5
0
19 Jun 2024
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
A. B. M. A. Rahman
Saeed Anwar
Muhammad Usman
Ajmal Mian
HILM
71
3
0
13 Jun 2024
Investigating and Addressing Hallucinations of LLMs in Tasks Involving
  Negation
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation
Neeraj Varshney
Satyam Raj
Venkatesh Mishra
Agneet Chatterjee
Ritika Sarkar
Amir Saeidi
Chitta Baral
LRM
89
10
0
08 Jun 2024
ValueBench: Towards Comprehensively Evaluating Value Orientations and
  Understanding of Large Language Models
ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models
Yuanyi Ren
Haoran Ye
Hanjun Fang
Xin Zhang
Guojie Song
LLMAGELM
73
7
0
06 Jun 2024
Dishonesty in Helpful and Harmless Alignment
Dishonesty in Helpful and Harmless Alignment
Youcheng Huang
Jingkun Tang
Duanyu Feng
Zheng Zhang
Wenqiang Lei
Jiancheng Lv
Anthony G. Cohn
LLMSV
80
4
0
04 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future
  Pathways
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
131
47
0
04 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
99
12
0
02 Jun 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal
  Research Tools
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
Varun Magesh
Faiz Surani
Matthew Dahl
Mirac Suzgun
Christopher D. Manning
Daniel E. Ho
HILMELMAILaw
66
79
0
30 May 2024
Previous
1234
Next