Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.12729
Cited By
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
25 May 2020
Logan Engstrom
Andrew Ilyas
Shibani Santurkar
Dimitris Tsipras
Firdaus Janoos
L. Rudolph
A. Madry
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO"
46 / 46 papers shown
Title
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
Teng Xiao
Zhen Ge
Sujay Sanghavi
Tian Wang
Julian Katz-Samuels
Marc Versage
Qingjun Cui
Trishul Chilimbi
31
0
0
13 May 2025
Onboard Optimization and Learning: A Survey
Monirul Islam Pavel
Siyi Hu
Mahardhika Pratama
Ryszard Kowalczyk
33
0
0
07 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
41
1
0
05 May 2025
Dynamic Action Interpolation: A Universal Approach for Accelerating Reinforcement Learning with Expert Guidance
Wenjun Cao
52
0
0
26 Apr 2025
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
Zelei Cheng
Xin-Qiang Cai
Yuting Tang
Pushi Zhang
Boming Yang
Masashi Sugiyama
Xinyu Xing
49
0
0
10 Mar 2025
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Teng Xiao
Yige Yuan
Ziyang Chen
Mingxiao Li
Shangsong Liang
Zhaochun Ren
V. Honavar
105
6
0
21 Feb 2025
Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope?
Michael Doherty
Robin Matzner
Rasoul Sadeghi
Polina Bayvel
Alejandra Beghelli
67
0
0
18 Feb 2025
Evolution and The Knightian Blindspot of Machine Learning
Joel Lehman
Elliot Meyerson
Tarek El-Gaaly
Kenneth O. Stanley
Tarin Ziyaee
99
2
0
22 Jan 2025
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu
Wei Xiong
Jie Jessie Ren
Lichang Chen
Junru Wu
...
Yuan Liu
Bilal Piot
Abe Ittycheriah
Aviral Kumar
Mohammad Saleh
AAML
56
15
0
20 Sep 2024
From Lists to Emojis: How Format Bias Affects Model Alignment
Xuanchang Zhang
Wei Xiong
Lichang Chen
Dinesh Manocha
Heng Huang
Tong Zhang
ALM
37
11
0
18 Sep 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Buhua Liu
Shitong Shao
Bao Li
Lichen Bai
Zhiqiang Xu
Haoyi Xiong
James Kwok
Sumi Helal
Zeke Xie
49
12
0
11 Sep 2024
Simplifying Deep Temporal Difference Learning
Matteo Gallici
Mattie Fellows
Benjamin Ellis
B. Pou
Ivan Masmitja
Jakob Foerster
Mario Martin
OffRL
62
17
0
05 Jul 2024
RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning
Mingqi Yuan
Roger Creus Castanyer
Bo Li
Xin Jin
Glen Berseth
Wenjun Zeng
40
0
0
29 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
53
43
0
26 May 2024
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong
Zikang Shan
Guhao Feng
Li Zhao
Xinle Cheng
Jiang Bian
Di He
Jiang Bian
Liwei Wang
63
57
0
29 Apr 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Haoxiang Wang
Yong Lin
Wei Xiong
Rui Yang
Shizhe Diao
Shuang Qiu
Han Zhao
Tong Zhang
40
72
0
28 Feb 2024
An Invitation to Deep Reinforcement Learning
Bernhard Jaeger
Andreas Geiger
OffRL
OOD
80
5
0
13 Dec 2023
Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing
Feiyang Han
Yimin Wei
Zhaofeng Liu
Yanxing Qi
43
1
0
24 Nov 2023
Improving Emotional Expression and Cohesion in Image-Based Playlist Description and Music Topics: A Continuous Parameterization Approach
Yuelyu Ji
Yuheng Song
Wei Wang
Ruoyi Xu
Zhongqian Xie
Huiyun Liu
DiffM
43
1
0
02 Oct 2023
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment
Tianhao Wu
Banghua Zhu
Ruoyu Zhang
Zhaojin Wen
Kannan Ramchandran
Jiantao Jiao
44
55
0
30 Sep 2023
Secrets of RLHF in Large Language Models Part I: PPO
Rui Zheng
Shihan Dou
Songyang Gao
Yuan Hua
Wei Shen
...
Hang Yan
Tao Gui
Qi Zhang
Xipeng Qiu
Xuanjing Huang
ALM
OffRL
55
160
0
11 Jul 2023
Revisiting the Minimalist Approach to Offline Reinforcement Learning
Denis Tarasov
Vladislav Kurenkov
Alexander Nikulin
Sergey Kolesnikov
OffRL
38
37
0
16 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Rui Pan
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
18
410
0
13 Apr 2023
Curiosity-driven Exploration in Sparse-reward Multi-agent Reinforcement Learning
Jiong Li
Pratik Gajane
39
4
0
21 Feb 2023
Maneuver Decision-Making For Autonomous Air Combat Through Curriculum Learning And Reinforcement Learning With Sparse Rewards
Yuxin Wei
Hong-Peng Zhang
Chang Huang
18
3
0
12 Feb 2023
Joint action loss for proximal policy optimization
Xiulei Song
Yi-Fan Jin
Greg Slabaugh
Simon Lucas
21
0
0
26 Jan 2023
Optimal scheduling of island integrated energy systems considering multi-uncertainties and hydrothermal simultaneous transmission: A deep reinforcement learning approach
Yang Li
Fanjin Bu
Yuanzheng Li
Chao Long
25
87
0
27 Dec 2022
Deep Black-Box Reinforcement Learning with Movement Primitives
Fabian Otto
Onur Celik
Hongyi Zhou
Hanna Ziesche
Ngo Anh Vien
Gerhard Neumann
OffRL
24
19
0
18 Oct 2022
Towards a Standardised Performance Evaluation Protocol for Cooperative MARL
R. Gorsane
Omayma Mahjoub
Ruan de Kock
Roland Dubb
Siddarth S. Singh
Arnu Pretorius
OffRL
44
50
0
21 Sep 2022
Understanding reinforcement learned crowds
Ariel Kwiatkowski
Vicky Kalogeiton
Julien Pettré
Marie-Paule Cani
24
10
0
19 Sep 2022
Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to Cooperative MARL
J. Kuba
Xidong Feng
Shiyao Ding
Hao Dong
Jun Wang
Yaodong Yang
26
16
0
02 Aug 2022
Post-processing Networks: Method for Optimizing Pipeline Task-oriented Dialogue Systems using Reinforcement Learning
Atsumoto Ohashi
Ryuichiro Higashinaka
OffRL
24
7
0
25 Jul 2022
MetaSlicing: A Novel Resource Allocation Framework for Metaverse
N. Chu
D. Hoang
Diep N. Nguyen
Khoa T. Phan
E. Dutkiewicz
Dusist Niyato
Tao Shu
44
46
0
23 May 2022
Combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task
Vittorio Giammarino
Matthew F. Dunne
Kylie N. Moore
Michael Hasselmo
Chantal E. Stern
I. Paschalidis
OffRL
39
5
0
11 Mar 2022
Deep Learning Reproducibility and Explainable AI (XAI)
Anastasia-Maria Leventi-Peetz
T. Östreich
19
9
0
23 Feb 2022
You May Not Need Ratio Clipping in PPO
Mingfei Sun
Vitaly Kurin
Guoqing Liu
Sam Devlin
Tao Qin
Katja Hofmann
Shimon Whiteson
18
15
0
31 Jan 2022
Mirror Learning: A Unifying Framework of Policy Optimisation
J. Kuba
Christian Schroeder de Witt
Jakob N. Foerster
29
24
0
07 Jan 2022
A Closer Look at Advantage-Filtered Behavioral Cloning in High-Noise Datasets
J. E. Grigsby
Yanjun Qi
OffRL
34
5
0
10 Oct 2021
Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer
Yining Ma
Jingwen Li
Zhiguang Cao
Wen Song
Le Zhang
Zhenghua Chen
Jing Tang
83
130
0
06 Oct 2021
A Pragmatic Look at Deep Imitation Learning
Kai Arulkumaran
D. Lillrank
35
9
0
04 Aug 2021
Solve routing problems with a residual edge-graph attention neural network
Kun Lei
Peng Guo
Yi Wang
Xiao Wu
Wenchao Zhao
37
54
0
06 May 2021
Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning
Jian Hu
Siyang Jiang
Seth Austin Harding
Haibin Wu
Shihua Liao
24
86
0
06 Feb 2021
Differentiable Trust Region Layers for Deep Reinforcement Learning
Fabian Otto
P. Becker
Ngo Anh Vien
Hanna Ziesche
Gerhard Neumann
OffRL
41
19
0
22 Jan 2021
Robust Reinforcement Learning on State Observations with Learned Optimal Adversary
Huan Zhang
Hongge Chen
Duane S. Boning
Cho-Jui Hsieh
67
163
0
21 Jan 2021
POPO: Pessimistic Offline Policy Optimization
Qiang He
Xinwen Hou
OffRL
37
10
0
26 Dec 2020
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations
Huan Zhang
Hongge Chen
Chaowei Xiao
Bo-wen Li
Mingyan D. Liu
Duane S. Boning
Cho-Jui Hsieh
AAML
47
261
0
19 Mar 2020
1