Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.13548
Cited By
v1
v2
v3
v4 (latest)
Towards Understanding Sycophancy in Language Models
20 October 2023
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
Samuel R. Bowman
Newton Cheng
Esin Durmus
Zac Hatfield-Dodds
Scott R. Johnston
Shauna Kravec
Timothy Maxwell
Sam McCandlish
Kamal Ndousse
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Towards Understanding Sycophancy in Language Models"
28 / 178 papers shown
Title
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Miles Turpin
Julian Michael
Ethan Perez
Sam Bowman
ReLM
LRM
100
443
0
07 May 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
77
404
0
19 Dec 2022
On the Sensitivity of Reward Inference to Misspecified Human Models
Joey Hong
Kush S. Bhatia
Anca Dragan
52
26
0
09 Dec 2022
Fine-tuning language models to find agreement among humans with diverse preferences
Michiel A. Bakker
Martin Chadwick
Hannah R. Sheahan
Michael Henry Tessler
Lucy Campbell-Gillingham
...
Nat McAleese
Amelia Glaese
John Aslanides
M. Botvinick
Christopher Summerfield
ALM
110
236
0
28 Nov 2022
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALM
ELM
79
132
0
04 Nov 2022
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
112
568
0
19 Oct 2022
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
311
535
0
28 Sep 2022
Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning
David Lindner
Mennatallah El-Assady
OffRL
71
19
0
27 Jun 2022
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALM
ELM
112
306
0
12 Jun 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
256
2,623
0
12 Apr 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
888
13,207
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
850
9,683
0
28 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
196
1,294
0
17 Dec 2021
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie C. Lin
Jacob Hilton
Owain Evans
HILM
149
1,942
0
08 Sep 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLM
FaML
183
2,405
0
05 Mar 2021
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
187
4,572
0
07 Sep 2020
Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning
Rahul Pandey
Hemant Purohit
Carlos Castillo
V. Shalin
37
36
0
07 Jul 2020
Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro
Du Phan
Neeraj Pradhan
M. Jankowiak
61
360
0
24 Dec 2019
On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference
Rohin Shah
Noah Gundotra
Pieter Abbeel
Anca Dragan
49
72
0
23 Jun 2019
Scalable agent alignment via reward modeling: a research direction
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
118
420
0
19 Nov 2018
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
266
223
0
02 May 2018
Occam's razor is insufficient to infer the preferences of irrational agents
Stuart Armstrong
Sören Mindermann
86
93
0
15 Dec 2017
Deep reinforcement learning from human preferences
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
218
3,377
0
12 Jun 2017
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling
Dani Yogatama
Chris Dyer
Phil Blunsom
AIMat
109
737
0
11 May 2017
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
237
2,692
0
09 May 2017
Learning Mixtures of Plackett-Luce Models
Zhibing Zhao
P. Piech
Lirong Xia
53
46
0
23 Mar 2016
MCMC using Hamiltonian dynamics
Radford M. Neal
298
3,282
0
09 Jun 2012
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo
Matthew D. Hoffman
Andrew Gelman
186
4,315
0
18 Nov 2011
Previous
1
2
3
4