Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.17401
Cited By
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
25 September 2024
Qining Zhang
Lei Ying
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference"
50 / 57 papers shown
Title
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis
Qining Zhang
Honghao Wei
Lei Ying
OffRL
95
2
0
11 Jun 2024
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
Rafael Rafailov
Yaswanth Chittepu
Ryan Park
Harshit S. Sikchi
Joey Hejna
Bradley Knox
Chelsea Finn
S. Niekum
102
61
0
05 Jun 2024
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie
Dylan J. Foster
Akshay Krishnamurthy
Corby Rosset
Ahmed Hassan Awadallah
Alexander Rakhlin
65
44
0
31 May 2024
RLHF Workflow: From Reward Modeling to Online RLHF
Hanze Dong
Wei Xiong
Bo Pang
Haoxiang Wang
Han Zhao
Yingbo Zhou
Nan Jiang
Doyen Sahoo
Caiming Xiong
Tong Zhang
OffRL
53
118
0
13 May 2024
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark
Yihua Zhang
Pingzhi Li
Junyuan Hong
Jiaxiang Li
Yimeng Zhang
...
Wotao Yin
Mingyi Hong
Zhangyang Wang
Sijia Liu
Tianlong Chen
76
55
0
18 Feb 2024
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
Yihan Du
Anna Winnicki
Gal Dalal
Shie Mannor
R. Srikant
20
13
0
15 Feb 2024
Generalized Preference Optimization: A Unified Approach to Offline Alignment
Yunhao Tang
Z. Guo
Zeyu Zheng
Daniele Calandriello
Rémi Munos
Mark Rowland
Pierre Harvey Richemond
Michal Valko
Bernardo Avila-Pires
Bilal Piot
51
111
0
08 Feb 2024
Direct Language Model Alignment from Online AI Feedback
Shangmin Guo
Biao Zhang
Tianlin Liu
Tianqi Liu
Misha Khalman
...
Thomas Mesnard
Yao-Min Zhao
Bilal Piot
Johan Ferret
Mathieu Blondel
ALM
61
152
0
07 Feb 2024
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
Winnie Xu
Niklas Muennighoff
Dan Jurafsky
Douwe Kiela
238
532
0
02 Feb 2024
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
Banghua Zhu
Michael I. Jordan
Jiantao Jiao
55
30
0
29 Jan 2024
A Survey of Reinforcement Learning from Human Feedback
Timo Kaufmann
Paul Weng
Viktor Bengs
Eyke Hüllermeier
OffRL
54
142
0
22 Dec 2023
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
Wei Xiong
Hanze Dong
Chen Ye
Ziqi Wang
Han Zhong
Heng Ji
Nan Jiang
Tong Zhang
OffRL
72
192
0
18 Dec 2023
Making RL with Preference-based Feedback Efficient via Randomization
Runzhe Wu
Wen Sun
27
27
0
23 Oct 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
163
615
0
18 Oct 2023
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
Souradip Chakraborty
Amrit Singh Bedi
Alec Koppel
Dinesh Manocha
Huazheng Wang
Mengdi Wang
Furong Huang
69
27
0
03 Aug 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
110
512
0
27 Jul 2023
Is RLHF More Difficult than Standard RL?
Yuanhao Wang
Qinghua Liu
Chi Jin
OffRL
33
62
0
25 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
313
3,895
0
29 May 2023
Provable Reward-Agnostic Preference-Based Reinforcement Learning
Wenhao Zhan
Masatoshi Uehara
Wen Sun
Jason D. Lee
60
10
0
29 May 2023
Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism
Zihao Li
Zhuoran Yang
Mengdi Wang
OffRL
73
59
0
29 May 2023
Fine-Tuning Language Models with Just Forward Passes
Sadhika Malladi
Tianyu Gao
Eshaan Nichani
Alexandru Damian
Jason D. Lee
Danqi Chen
Sanjeev Arora
107
196
0
27 May 2023
Provable Offline Preference-Based Reinforcement Learning
Wenhao Zhan
Masatoshi Uehara
Nathan Kallus
Jason D. Lee
Wen Sun
OffRL
74
29
0
24 May 2023
SLiC-HF: Sequence Likelihood Calibration with Human Feedback
Yao-Min Zhao
Rishabh Joshi
Tianqi Liu
Misha Khalman
Mohammad Saleh
Peter J. Liu
59
294
0
17 May 2023
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning
Dingwen Kong
Lin F. Yang
48
12
0
18 Apr 2023
Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles
Zhiwei Tang
Dmitry Rybin
Tsung-Hui Chang
ALM
DiffM
88
29
0
07 Mar 2023
Principled Reinforcement Learning with Human Feedback from Pairwise or
K
K
K
-wise Comparisons
Banghua Zhu
Jiantao Jiao
Michael I. Jordan
OffRL
77
201
0
26 Jan 2023
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation
Xiaoyu Chen
Han Zhong
Zhuoran Yang
Zhaoran Wang
Liwei Wang
145
69
0
23 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
771
12,835
0
04 Mar 2022
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
173
1,267
0
17 Dec 2021
Dueling RL: Reinforcement Learning with Trajectory Preferences
Aldo Pacchiano
Aadirupa Saha
Jonathan Lee
79
89
0
08 Nov 2021
Recursively Summarizing Books with Human Feedback
Jeff Wu
Long Ouyang
Daniel M. Ziegler
Nissan Stiennon
Ryan J. Lowe
Jan Leike
Paul Christiano
ALM
139
303
0
22 Sep 2021
Reward is enough for convex MDPs
Tom Zahavy
Brendan O'Donoghue
Guillaume Desjardins
Satinder Singh
89
74
0
01 Jun 2021
A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization
HanQin Cai
Y. Lou
Daniel McKenzie
W. Yin
74
45
0
21 Feb 2021
Learning to summarize from human feedback
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
ALM
218
2,124
0
02 Sep 2020
Variational Policy Gradient Method for Reinforcement Learning with General Utilities
Junyu Zhang
Alec Koppel
Amrit Singh Bedi
Csaba Szepesvári
Mengdi Wang
57
139
0
04 Jul 2020
Preference-based Reinforcement Learning with Finite-Time Guarantees
Yichong Xu
Ruosong Wang
Lin F. Yang
Aarti Singh
A. Dubrawski
94
58
0
16 Jun 2020
Deep Reinforcement Learning for Autonomous Driving: A Survey
B. R. Kiran
Ibrahim Sobh
V. Talpaert
Patrick Mannion
A. A. Sallab
S. Yogamani
P. Pérez
321
1,671
0
02 Feb 2020
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
452
1,717
0
18 Sep 2019
Dueling Posterior Sampling for Preference-Based Reinforcement Learning
Ellen R. Novoseller
Yibing Wei
Yanan Sui
Yisong Yue
J. W. Burdick
66
62
0
04 Aug 2019
On the Convergence of Adam and Beyond
Sashank J. Reddi
Satyen Kale
Surinder Kumar
87
2,494
0
19 Apr 2019
Provably Efficient Maximum Entropy Exploration
Elad Hazan
Sham Kakade
Karan Singh
A. V. Soest
67
297
0
06 Dec 2018
Preference-based Online Learning with Dueling Bandits: A Survey
Viktor Bengs
R. Busa-Fekete
Adil El Mesaoudi-Paul
Eyke Hüllermeier
87
113
0
30 Jul 2018
A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress
Saurabh Arora
Prashant Doshi
OffRL
75
605
0
18 Jun 2018
Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization
Sijia Liu
B. Kailkhura
Pin-Yu Chen
Pai-Shun Ting
Shiyu Chang
Lisa Amini
87
182
0
25 May 2018
signSGD: Compressed Optimisation for Non-Convex Problems
Jeremy Bernstein
Yu Wang
Kamyar Azizzadenesheli
Anima Anandkumar
FedML
ODL
87
1,041
0
13 Feb 2018
Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents
Edoardo Conti
Vashisht Madhavan
F. Such
Joel Lehman
Kenneth O. Stanley
Jeff Clune
61
347
0
18 Dec 2017
Regret Analysis for Continuous Dueling Bandit
Wataru Kumagai
93
28
0
21 Nov 2017
Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications
Sijia Liu
Jie Chen
Pin-Yu Chen
Alfred Hero
43
83
0
21 Oct 2017
Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
Garrett A. Warnell
Nicholas R. Waytowich
Vernon J. Lawhern
Peter Stone
42
269
0
28 Sep 2017
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
446
18,931
0
20 Jul 2017
1
2
Next