Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.19409
Cited By
Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning
30 April 2024
Mathieu Rita
Florian Strub
Rahma Chaabouni
Paul Michel
Emmanuel Dupoux
Olivier Pietquin
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning"
15 / 15 papers shown
Title
Language Model Alignment with Elastic Reset
Michael Noukhovitch
Samuel Lavoie
Florian Strub
Aaron Courville
KELM
125
26
0
06 Dec 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
163
613
0
18 Oct 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
53
149
0
07 Jun 2023
Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
Rajkumar Ramamurthy
Prithviraj Ammanabrolu
Kianté Brantley
Jack Hessel
R. Sifa
Christian Bauckhage
Hannaneh Hajishirzi
Yejin Choi
OffRL
82
246
0
03 Oct 2022
Countering Language Drift with Seeded Iterated Learning
Yuchen Lu
Soumye Singhal
Florian Strub
Olivier Pietquin
Aaron Courville
62
76
0
28 Mar 2020
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
192
7,465
0
02 Oct 2019
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
449
1,717
0
18 Sep 2019
Training language GANs from Scratch
Cyprien de Masson dÁutume
Mihaela Rosca
Jack W. Rae
S. Mohamed
GAN
SyDa
33
87
0
23 May 2019
Overcoming Exploration in Reinforcement Learning with Demonstrations
Ashvin Nair
Bob McGrew
Marcin Andrychowicz
Wojciech Zaremba
Pieter Abbeel
OffRL
86
783
0
28 Sep 2017
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Aravind Rajeswaran
Vikash Kumar
Abhishek Gupta
Giulia Vezzani
John Schulman
E. Todorov
Sergey Levine
126
1,093
0
28 Sep 2017
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
444
18,931
0
20 Jul 2017
Deal or No Deal? End-to-End Learning for Negotiation Dialogues
M. Lewis
Denis Yarats
Yann N. Dauphin
Devi Parikh
Dhruv Batra
LLMAG
65
413
0
16 Jun 2017
A Deep Reinforced Model for Abstractive Summarization
Romain Paulus
Caiming Xiong
R. Socher
AI4TS
175
1,556
0
11 May 2017
Sequence Level Training with Recurrent Neural Networks
MarcÁurelio Ranzato
S. Chopra
Michael Auli
Wojciech Zaremba
96
1,614
0
20 Nov 2015
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Samy Bengio
Oriol Vinyals
Navdeep Jaitly
Noam M. Shazeer
133
2,031
0
09 Jun 2015
1