Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.19770
Cited By
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
26 May 2025
Ruizhe Shi
Minhak Song
Runlong Zhou
Zihan Zhang
Maryam Fazel
S. S. Du
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO"
19 / 19 papers shown
Title
What Makes a Reward Model a Good Teacher? An Optimization Perspective
Noam Razin
Zixuan Wang
Hubert Strauss
Stanley Wei
Jason D. Lee
Sanjeev Arora
81
9
0
19 Mar 2025
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Gokul Swamy
Sanjiban Choudhury
Wen Sun
Zhiwei Steven Wu
J. Andrew Bagnell
OffRL
101
13
0
03 Mar 2025
Distributionally Robust Reinforcement Learning with Human Feedback
Debmalya Mandal
Paulius Sasnauskas
Goran Radanović
68
3
0
01 Mar 2025
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang
Bingcong Li
Christoph Dann
Niao He
OffRL
182
1
0
26 Feb 2025
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Shicong Cen
Jincheng Mei
Katayoon Goshvadi
Hanjun Dai
Tong Yang
Sherry Yang
Dale Schuurmans
Yuejie Chi
Bo Dai
OffRL
105
31
0
20 Feb 2025
Process Reinforcement through Implicit Rewards
Ganqu Cui
Lifan Yuan
Ziyi Wang
Hanbin Wang
Wendi Li
...
Yu Cheng
Zhiyuan Liu
Maosong Sun
Bowen Zhou
Ning Ding
OffRL
LRM
102
83
0
03 Feb 2025
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin
Sadhika Malladi
Adithya Bhaskar
Danqi Chen
Sanjeev Arora
Boris Hanin
146
23
0
11 Oct 2024
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Yuancheng Xu
Udari Madhushani Sehwag
Alec Koppel
Sicheng Zhu
Bang An
Furong Huang
Sumitra Ganesh
89
11
0
10 Oct 2024
The Crucial Role of Samplers in Online Direct Preference Optimization
Ruizhe Shi
Runlong Zhou
Simon S. Du
79
9
0
29 Sep 2024
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie
Dylan J. Foster
Akshay Krishnamurthy
Corby Rosset
Ahmed Hassan Awadallah
Alexander Rakhlin
65
40
0
31 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
75
49
0
26 May 2024
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Shusheng Xu
Wei Fu
Jiaxuan Gao
Wenjie Ye
Weiling Liu
Zhiyu Mei
Guangju Wang
Chao Yu
Yi Wu
93
149
0
16 Apr 2024
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
Andi Nika
Debmalya Mandal
Parameswaran Kamalaruban
Georgios Tzannetos
Goran Radanović
Adish Singla
33
14
0
04 Mar 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Bing Wang
Rui Zheng
Luyao Chen
Yan Liu
Shihan Dou
...
Qi Zhang
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yuanyuan Jiang
ALM
77
106
0
11 Jan 2024
A General Theoretical Paradigm to Understand Learning from Human Preferences
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
155
597
0
18 Oct 2023
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
Zhanhui Zhou
Jie Liu
Chao Yang
Jing Shao
Yu Liu
Xiangyu Yue
Wanli Ouyang
Yu Qiao
65
59
0
05 Oct 2023
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
434
1,664
0
18 Sep 2019
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
285
18,685
0
20 Jul 2017
Estimation from Pairwise Comparisons: Sharp Minimax Bounds with Topology Dependence
Nihar B. Shah
Sivaraman Balakrishnan
Joseph K. Bradley
Abhay K. Parekh
Kannan Ramchandran
Martin J. Wainwright
131
164
0
06 May 2015
1