ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.13548
  4. Cited By
Towards Understanding Sycophancy in Language Models
v1v2v3v4 (latest)

Towards Understanding Sycophancy in Language Models

20 October 2023
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
Samuel R. Bowman
Newton Cheng
Esin Durmus
Zac Hatfield-Dodds
Scott R. Johnston
Shauna Kravec
Timothy Maxwell
Sam McCandlish
Kamal Ndousse
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
ArXiv (abs)PDFHTML

Papers citing "Towards Understanding Sycophancy in Language Models"

50 / 178 papers shown
Title
Participation in the age of foundation models
Participation in the age of foundation models
Harini Suresh
Emily Tseng
Meg Young
Mary L. Gray
Emma Pierson
Karen Levy
87
28
0
29 May 2024
Efficient Model-agnostic Alignment via Bayesian Persuasion
Efficient Model-agnostic Alignment via Bayesian Persuasion
Fengshuo Bai
Mingzhi Wang
Zhaowei Zhang
Boyuan Chen
Yinda Xu
Ying Wen
Yaodong Yang
75
6
0
29 May 2024
Are Large Language Models Moral Hypocrites? A Study Based on Moral
  Foundations
Are Large Language Models Moral Hypocrites? A Study Based on Moral Foundations
José Luiz Nunes
G. F. C. F. Almeida
Marcelo de Araújo
Simone D. J. Barbosa
ELMPILM
75
3
0
17 May 2024
COBias and Debias: Balancing Class Accuracies for Language Models in Inference Time via Nonlinear Integer Programming
COBias and Debias: Balancing Class Accuracies for Language Models in Inference Time via Nonlinear Integer Programming
Ruixi Lin
Yang You
69
1
0
13 May 2024
LLM-Generated Black-box Explanations Can Be Adversarially Helpful
LLM-Generated Black-box Explanations Can Be Adversarially Helpful
R. Ajwani
Shashidhar Reddy Javaji
Frank Rudzicz
Zining Zhu
AAML
67
8
0
10 May 2024
Uncovering Deceptive Tendencies in Language Models: A Simulated Company
  AI Assistant
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi
Evan Hubinger
76
14
0
25 Apr 2024
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of
  Theories, Detection Methods, and Opportunities
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities
Xiaomin Yu
Yezhaohui Wang
Yanfang Chen
Zhen Tao
Dinghao Xi
Shichao Song
Pengnian Qi
Zhiyu Li
100
10
0
25 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
127
158
0
22 Apr 2024
Constructing Benchmarks and Interventions for Combating Hallucinations
  in LLMs
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
Adi Simhi
Jonathan Herzig
Idan Szpektor
Yonatan Belinkov
HILM
97
13
0
15 Apr 2024
Best Practices and Lessons Learned on Synthetic Data for Language Models
Best Practices and Lessons Learned on Synthetic Data for Language Models
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
...
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
SyDaEgoV
119
96
0
11 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELMKELM
159
39
0
08 Apr 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im
Yixuan Li
ALM
77
14
0
27 Mar 2024
Language Models in Dialogue: Conversational Maxims for Human-AI
  Interactions
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Erik Miehling
Manish Nagireddy
P. Sattigeri
Elizabeth M. Daly
David Piorkowski
John T. Richards
ALM
88
14
0
22 Mar 2024
Sabiá-2: A New Generation of Portuguese Large Language Models
Sabiá-2: A New Generation of Portuguese Large Language Models
Thales Sales Almeida
Hugo Queiroz Abonizio
Rodrigo Nogueira
Ramon Pires
ELM
97
5
0
14 Mar 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Zhiqing Sun
Longhui Yu
Yikang Shen
Weiyang Liu
Yiming Yang
Sean Welleck
Chuang Gan
81
69
0
14 Mar 2024
Knowledge Conflicts for LLMs: A Survey
Knowledge Conflicts for LLMs: A Survey
Rongwu Xu
Zehan Qi
Zhijiang Guo
Cunxiang Wang
Hongru Wang
Yue Zhang
Wei Xu
265
120
0
13 Mar 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
James Chua
Edward Rees
Hunar Batra
Samuel R. Bowman
Julian Michael
Ethan Perez
Miles Turpin
LRM
121
13
0
08 Mar 2024
A Safe Harbor for AI Evaluation and Red Teaming
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
...
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
97
42
0
07 Mar 2024
A Language Model's Guide Through Latent Space
A Language Model's Guide Through Latent Space
Dimitri von Rutte
Sotiris Anagnostidis
Gregor Bachmann
Thomas Hofmann
96
28
0
22 Feb 2024
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
  Deceptive Prompts
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
Yusu Qian
Haotian Zhang
Yinfei Yang
Zhe Gan
178
30
0
20 Feb 2024
Evolving AI Collectives to Enhance Human Diversity and Enable
  Self-Regulation
Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation
Shiyang Lai
Yujin Potter
Junsol Kim
Richard Zhuang
Dawn Song
James Evans
82
5
0
19 Feb 2024
Dissecting Human and LLM Preferences
Dissecting Human and LLM Preferences
Junlong Li
Fan Zhou
Shichao Sun
Yikai Zhang
Hai Zhao
Pengfei Liu
ALM
76
6
0
17 Feb 2024
Antagonistic AI
Antagonistic AI
Alice Cai
Ian Arawjo
Elena L. Glassman
70
3
0
12 Feb 2024
Imagining a Future of Designing with AI: Dynamic Grounding, Constructive
  Negotiation, and Sustainable Motivation
Imagining a Future of Designing with AI: Dynamic Grounding, Constructive Negotiation, and Sustainable Motivation
Priyan Vaithilingam
Ian Arawjo
Elena L. Glassman
74
21
0
12 Feb 2024
Social Evolution of Published Text and The Emergence of Artificial
  Intelligence Through Large Language Models and The Problem of Toxicity and
  Bias
Social Evolution of Published Text and The Emergence of Artificial Intelligence Through Large Language Models and The Problem of Toxicity and Bias
Arifa Khan
P. Saravanan
S. K. Venkatesan
25
1
0
11 Feb 2024
Factuality of Large Language Models in the Year 2024
Factuality of Large Language Models in the Year 2024
Yuxia Wang
Minghan Wang
Muhammad Arslan Manzoor
Fei Liu
Georgi Georgiev
Rocktim Jyoti Das
Preslav Nakov
LRMHILM
71
7
0
04 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
122
94
0
25 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
176
104
0
22 Jan 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Bing Wang
Rui Zheng
Luyao Chen
Yan Liu
Shihan Dou
...
Qi Zhang
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yuanyuan Jiang
ALM
109
110
0
11 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language
  Model Systems
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
ZuJie Wen
Ke Xu
Qi Li
136
64
0
11 Jan 2024
Large Legal Fictions: Profiling Legal Hallucinations in Large Language
  Models
Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models
Matthew Dahl
Varun Magesh
Mirac Suzgun
Daniel E. Ho
HILMAILaw
110
84
0
02 Jan 2024
A Computational Framework for Behavioral Assessment of LLM Therapists
A Computational Framework for Behavioral Assessment of LLM Therapists
Yu Ying Chiu
Ashish Sharma
Inna Wanyin Lin
Tim Althoff
AI4MH
73
43
0
01 Jan 2024
Exploiting Novel GPT-4 APIs
Exploiting Novel GPT-4 APIs
Kellin Pelrine
Mohammad Taufeeque
Michal Zajkac
Euan McLean
Adam Gleave
SILM
56
21
0
21 Dec 2023
Alignment for Honesty
Alignment for Honesty
Yuqing Yang
Ethan Chern
Xipeng Qiu
Graham Neubig
Pengfei Liu
72
34
0
12 Dec 2023
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MHELM
120
736
0
20 Nov 2023
System 2 Attention (is something you might need too)
System 2 Attention (is something you might need too)
Jason Weston
Sainbayar Sukhbaatar
RALMOffRLLRM
78
65
0
20 Nov 2023
Pregnant Questions: The Importance of Pragmatic Awareness in Maternal
  Health Question Answering
Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering
Neha Srikanth
Rupak Sarkar
Heran Mane
Elizabeth M. Aparicio
Quynh C. Nguyen
Rachel Rudinger
Jordan Lee Boyd-Graber
45
4
0
16 Nov 2023
Are You Sure? Challenging LLMs Leads to Performance Drops in The
  FlipFlop Experiment
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment
Philippe Laban
Lidiya Murakhovs'ka
Caiming Xiong
Chien-Sheng Wu
LRM
71
22
0
14 Nov 2023
Generalization Analogies: A Testbed for Generalizing AI Oversight to
  Hard-To-Measure Domains
Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains
Joshua Clymer
Garrett Baker
Rohan Subramani
Sam Wang
81
6
0
13 Nov 2023
Frontier Language Models are not Robust to Adversarial Arithmetic, or
  "What do I need to say so you agree 2+2=5?
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
C. D. Freeman
Laura J. Culp
Aaron T Parisi
Maxwell Bileschi
Gamaleldin F. Elsayed
...
Peter J. Liu
Roman Novak
Yundi Qian
Noah Fiedel
Jascha Narain Sohl-Dickstein
AAML
54
2
0
08 Nov 2023
Contextual Confidence and Generative AI
Contextual Confidence and Generative AI
Shrey Jain
Zoe Hitzig
Pamela Mishkin
85
4
0
02 Nov 2023
Managing extreme AI risks amid rapid progress
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Philip Torr
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
85
67
0
26 Oct 2023
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk
Ishita Mediratta
Christoforos Nalmpantis
Jelena Luketina
Eric Hambro
Edward Grefenstette
Roberta Raileanu
AI4CEALM
182
149
0
10 Oct 2023
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Qiming Xie
Zengzhi Wang
Yi Feng
Rui Xia
AAMLHILM
89
9
0
03 Oct 2023
Simple synthetic data reduces sycophancy in large language models
Simple synthetic data reduces sycophancy in large language models
Jerry W. Wei
Da Huang
Yifeng Lu
Denny Zhou
Quoc V. Le
91
74
0
07 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
400
12,076
0
18 Jul 2023
Question Decomposition Improves the Faithfulness of Model-Generated
  Reasoning
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Ansh Radhakrishnan
Karina Nguyen
Anna Chen
Carol Chen
Carson E. Denison
...
Zac Hatfield-Dodds
Jared Kaplan
J. Brauner
Sam Bowman
Ethan Perez
ReLMLRMHILM
76
89
0
17 Jul 2023
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT
  and GPT-4 for Mining Insights at Scale
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Mining Insights at Scale
Jonas Oppenlaender
Joonas Hamalainen
75
6
0
08 Jun 2023
The False Promise of Imitating Proprietary LLMs
The False Promise of Imitating Proprietary LLMs
Arnav Gudibande
Eric Wallace
Charles Burton Snell
Xinyang Geng
Hao Liu
Pieter Abbeel
Sergey Levine
Dawn Song
ALM
116
207
0
25 May 2023
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
  Large Language Models
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models
Natalie Shapira
Mosh Levy
S. Alavi
Xuhui Zhou
Yejin Choi
Yoav Goldberg
Maarten Sap
Vered Shwartz
LLMAGELM
90
128
0
24 May 2023
Previous
1234
Next