ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.13548
  4. Cited By
Towards Understanding Sycophancy in Language Models
v1v2v3v4 (latest)

Towards Understanding Sycophancy in Language Models

20 October 2023
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
Samuel R. Bowman
Newton Cheng
Esin Durmus
Zac Hatfield-Dodds
Scott R. Johnston
Shauna Kravec
Timothy Maxwell
Sam McCandlish
Kamal Ndousse
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
ArXiv (abs)PDFHTML

Papers citing "Towards Understanding Sycophancy in Language Models"

28 / 178 papers shown
Title
Language Models Don't Always Say What They Think: Unfaithful
  Explanations in Chain-of-Thought Prompting
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Miles Turpin
Julian Michael
Ethan Perez
Sam Bowman
ReLMLRM
100
443
0
07 May 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
77
404
0
19 Dec 2022
On the Sensitivity of Reward Inference to Misspecified Human Models
On the Sensitivity of Reward Inference to Misspecified Human Models
Joey Hong
Kush S. Bhatia
Anca Dragan
52
26
0
09 Dec 2022
Fine-tuning language models to find agreement among humans with diverse
  preferences
Fine-tuning language models to find agreement among humans with diverse preferences
Michiel A. Bakker
Martin Chadwick
Hannah R. Sheahan
Michael Henry Tessler
Lucy Campbell-Gillingham
...
Nat McAleese
Amelia Glaese
John Aslanides
M. Botvinick
Christopher Summerfield
ALM
110
236
0
28 Nov 2022
Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALMELM
79
132
0
04 Nov 2022
Scaling Laws for Reward Model Overoptimization
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
112
568
0
19 Oct 2022
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALMAAML
311
535
0
28 Sep 2022
Humans are not Boltzmann Distributions: Challenges and Opportunities for
  Modelling Human Feedback and Interaction in Reinforcement Learning
Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning
David Lindner
Mennatallah El-Assady
OffRL
71
19
0
27 Jun 2022
Self-critiquing models for assisting human evaluators
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALMELM
112
306
0
12 Jun 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
256
2,623
0
12 Apr 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
888
13,207
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
850
9,683
0
28 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALMRALM
196
1,294
0
17 Dec 2021
TruthfulQA: Measuring How Models Mimic Human Falsehoods
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie C. Lin
Jacob Hilton
Owain Evans
HILM
149
1,942
0
08 Sep 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLMFaML
183
2,405
0
05 Mar 2021
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELMRALM
187
4,572
0
07 Sep 2020
Modeling and mitigating human annotation errors to design efficient
  stream processing systems with human-in-the-loop machine learning
Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning
Rahul Pandey
Hemant Purohit
Carlos Castillo
V. Shalin
37
36
0
07 Jul 2020
Composable Effects for Flexible and Accelerated Probabilistic
  Programming in NumPyro
Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro
Du Phan
Neeraj Pradhan
M. Jankowiak
61
360
0
24 Dec 2019
On the Feasibility of Learning, Rather than Assuming, Human Biases for
  Reward Inference
On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference
Rohin Shah
Noah Gundotra
Pieter Abbeel
Anca Dragan
49
72
0
23 Jun 2019
Scalable agent alignment via reward modeling: a research direction
Scalable agent alignment via reward modeling: a research direction
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
118
420
0
19 Nov 2018
AI safety via debate
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
266
223
0
02 May 2018
Occam's razor is insufficient to infer the preferences of irrational
  agents
Occam's razor is insufficient to infer the preferences of irrational agents
Stuart Armstrong
Sören Mindermann
86
93
0
15 Dec 2017
Deep reinforcement learning from human preferences
Deep reinforcement learning from human preferences
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
218
3,377
0
12 Jun 2017
Program Induction by Rationale Generation : Learning to Solve and
  Explain Algebraic Word Problems
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling
Dani Yogatama
Chris Dyer
Phil Blunsom
AIMat
109
737
0
11 May 2017
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for
  Reading Comprehension
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
237
2,692
0
09 May 2017
Learning Mixtures of Plackett-Luce Models
Learning Mixtures of Plackett-Luce Models
Zhibing Zhao
P. Piech
Lirong Xia
53
46
0
23 Mar 2016
MCMC using Hamiltonian dynamics
MCMC using Hamiltonian dynamics
Radford M. Neal
298
3,282
0
09 Jun 2012
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian
  Monte Carlo
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo
Matthew D. Hoffman
Andrew Gelman
186
4,315
0
18 Nov 2011
Previous
1234