Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.13548
Cited By
v1
v2
v3
v4 (latest)
Towards Understanding Sycophancy in Language Models
20 October 2023
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
Samuel R. Bowman
Newton Cheng
Esin Durmus
Zac Hatfield-Dodds
Scott R. Johnston
Shauna Kravec
Timothy Maxwell
Sam McCandlish
Kamal Ndousse
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Towards Understanding Sycophancy in Language Models"
50 / 178 papers shown
Title
Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)
Anna Neumann
Elisabeth Kirsten
Muhammad Bilal Zafar
Jatinder Singh
30
0
0
27 May 2025
EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding
Zhaowei Zhang
Minghua Yi
Mengmeng Wang
Fengshuo Bai
Zilong Zheng
Yipeng Kang
Yaodong Yang
48
1
0
26 May 2025
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Baihui Zheng
Boren Zheng
Kerui Cao
Y. Tan
Zhendong Liu
...
Jian Yang
Wenbo Su
Xiaoyong Zhu
Bo Zheng
Kaifu Zhang
ELM
77
0
0
26 May 2025
Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Yihe Fan
Wenqi Zhang
Xudong Pan
Min Yang
69
0
0
23 May 2025
AI-Augmented LLMs Achieve Therapist-Level Responses in Motivational Interviewing
Yinghui Huang
Yuxuan Jiang
Hui Liu
Yixin Cai
Weiqing Li
Xiangen Hu
AI4MH
233
0
0
23 May 2025
Social Sycophancy: A Broader Understanding of LLM Sycophancy
Myra Cheng
Sunny Yu
Cinoo Lee
Pranav Khadpe
Lujain Ibrahim
Dan Jurafsky
41
0
0
20 May 2025
From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents
Liangxuan Wu
Chao Wang
Tianming Liu
Yanjie Zhao
Haoyu Wang
AAML
69
0
0
19 May 2025
WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?
Zilu Tang
Afra Feyza Akyürek
Ekin Akyürek
Derry Wijaya
108
0
0
19 May 2025
Are vision language models robust to uncertain inputs?
Xi Wang
Eric Nalisnick
AAML
VLM
Presented at
ResearchTrend Connect | VLM
on
18 Jun 2025
112
1
0
17 May 2025
LLM Agents Are Hypersensitive to Nudges
Manuel Cherep
Pattie Maes
Nikhil Singh
73
0
0
16 May 2025
Must Read: A Systematic Survey of Computational Persuasion
Nimet Beyza Bozdag
Shuhaib Mehri
Xiaocheng Yang
Xiaomeng Jin
Zirui Cheng
Esin Durmus
Jiaxuan You
Heng Ji
Gokhan Tur
Dilek Hakkani-Tur
177
2
0
12 May 2025
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
88
1
0
06 May 2025
Real-World Gaps in AI Governance Research
Ilan Strauss
Isobel Moure
Tim O'Reilly
Sruly Rosenblat
142
1
0
30 Apr 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Yi Zeng
Yijiao Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
149
0
0
24 Apr 2025
AGI Is Coming... Right After AI Learns to Play Wordle
Sarath Shekkizhar
Romain Cosentino
LLMAG
99
0
0
21 Apr 2025
Establishing Reliability Metrics for Reward Models in Large Language Models
Yizhou Chen
Yawen Liu
Xuesi Wang
Qingtao Yu
Guangda Huzhang
Anxiang Zeng
Han Yu
Zhiming Zhou
78
0
0
21 Apr 2025
Evaluating the Goal-Directedness of Large Language Models
Tom Everitt
Cristina Garbacea
Alexis Bellot
Jonathan G. Richens
Henry Papadatos
Simeon Campos
Rohin Shah
ELM
LM&MA
LM&Ro
LRM
140
0
0
16 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
103
11
0
12 Apr 2025
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
185
1
0
09 Apr 2025
Unraveling Human-AI Teaming: A Review and Outlook
Bowen Lou
Tian Lu
T. S. Raghu
Yingjie Zhang
92
2
0
08 Apr 2025
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Pedro Ferreira
Wilker Aziz
Ivan Titov
LRM
68
0
0
07 Apr 2025
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder
Clement Dumas
Caden Juang
Bilal Chugtai
Neel Nanda
165
1
0
03 Apr 2025
Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery
Nicholas Clark
Hua Shen
Bill Howe
Tanushree Mitra
91
0
0
01 Apr 2025
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Yubo Li
Yidi Miao
Xueying Ding
Ramayya Krishnan
R. Padman
94
0
0
28 Mar 2025
Writing as a testbed for open ended agents
Sian Gooding
Lucia Lopez-Rivilla
Edward Grefenstette
LLMAG
120
3
0
25 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
97
3
0
13 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
93
0
0
08 Mar 2025
A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval
Yu Zhang
Shutong Qiao
Jiaqi Zhang
Tzu-Heng Lin
Chen Gao
Yongqian Li
LM&Ro
LM&MA
229
3
0
07 Mar 2025
Human Preferences for Constructive Interactions in Language Model Alignment
Yara Kyrychenko
Jon Roozenbeek
Brandon Davidson
S. V. D. Linden
Ramit Debnath
70
1
0
05 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models
Junsol Kim
James Evans
Aaron Schein
134
7
0
03 Mar 2025
What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text
Arturs Kanepajs
Aditi Basu
Sankalpa Ghose
Constance Li
Akshat Mehta
Ronak Mehta
Samuel David Tucker-Davis
Eric Zhou
Bob Fischer
Jacy Reese Anthis
ELM
ALM
130
1
0
03 Mar 2025
AI Will Always Love You: Studying Implicit Biases in Romantic AI Companions
Clare Grogan
Jackie Kay
Maria Perez-Ortiz
95
1
0
27 Feb 2025
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Shiven Sinha
Shashwat Goel
Ponnurangam Kumaraguru
Jonas Geiping
Matthias Bethge
Ameya Prabhu
ReLM
ELM
LRM
214
0
0
26 Feb 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
147
22
0
24 Feb 2025
Grounded Persuasive Language Generation for Automated Marketing
Jibang Wu
Chenghao Yang
Simon Mahns
Chaoqi Wang
Hao Zhu
Fei Fang
Haifeng Xu
Haifeng Xu
70
3
0
24 Feb 2025
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Jonathan Rystrøm
Hannah Rose Kirk
Scott A. Hale
85
7
0
23 Feb 2025
Lines of Thought in Large Language Models
Raphaël Sarfati
Toni J. B. Liu
Nicolas Boullé
Christopher Earls
LRM
VLM
LM&Ro
142
1
0
17 Feb 2025
Why human-AI relationships need socioaffective alignment
Hannah Rose Kirk
Iason Gabriel
Chris Summerfield
Bertie Vidgen
Scott A. Hale
85
9
0
04 Feb 2025
Breaking Focus: Contextual Distraction Curse in Large Language Models
Yue Huang
Yanbo Wang
Zixiang Xu
Chujie Gao
Siyuan Wu
Jiayi Ye
Preslav Nakov
Pin-Yu Chen
Wei Wei
AAML
85
4
0
03 Feb 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
164
2
0
22 Jan 2025
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Chaoqi Wang
Zhuokai Zhao
Yibo Jiang
Zhaorun Chen
Chen Zhu
...
Jiayi Liu
Lizhu Zhang
Xiangjun Fan
Hao Ma
Sinong Wang
136
5
0
16 Jan 2025
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
Fengxiang Wang
Ranjie Duan
Peng Xiao
Xiaojun Jia
Shiji Zhao
...
Hang Su
Jialing Tao
Hui Xue
Jun Zhu
Hui Xue
LLMAG
81
10
0
08 Jan 2025
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Haoyang Li
Xudong Han
Zenan Zhai
Honglin Mu
Hao Wang
...
Eduard H. Hovy
Iryna Gurevych
Preslav Nakov
Monojit Choudhury
Timothy Baldwin
ALM
57
3
0
24 Dec 2024
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World
Adrian de Wynter
95
1
0
02 Dec 2024
Sycophancy in Large Language Models: Causes and Mitigations
Lars Malmqvist
123
19
0
22 Nov 2024
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling
Michael Desmond
Karthikeyan N. Ramamurthy
Elizabeth M. Daly
Pierre Dognin
Jesus Rios
Djallel Bouneffouf
Miao Liu
LLMSV
153
5
0
19 Nov 2024
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
Caspar Oesterheld
Emery Cooper
Miles Kodama
Linh Chi Nguyen
Ethan Perez
101
1
0
15 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
162
1
0
12 Nov 2024
CoPrompter: User-Centric Evaluation of LLM Instruction Alignment for Improved Prompt Engineering
Ishika Joshi
Simra Shahid
Shreeya Venneti
Manushree Vasu
Yantao Zheng
Yunyao Li
Balaji Krishnamurthy
Gromit Yeuk-Yin Chan
70
4
0
09 Nov 2024
Hidden Persuaders: LLMs' Political Leaning and Their Influence on Voters
Yujin Potter
Shiyang Lai
Junsol Kim
James Evans
Basel Alomair
80
20
0
31 Oct 2024
1
2
3
4
Next