Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO

25 May 2020

Firdaus Janoos

Papers citing "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO"

46 / 46 papers shown

Title
InfoPO: On Mutual Information Maximization for Large Language Model Alignment Teng Xiao Zhen Ge Sujay Sanghavi Tian Wang Julian Katz-Samuels Marc Versage Qingjun Cui Trishul Chilimbi 31 0 0 13 May 2025
Onboard Optimization and Learning: A Survey Monirul Islam Pavel Siyi Hu Mahardhika Pratama Ryszard Kowalczyk 33 0 0 07 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design Miaomiao Ji Yanqiu Wu Zhibin Wu Shoujin Wang Jian Yang Mark Dras Usman Naseem 41 1 0 05 May 2025
Dynamic Action Interpolation: A Universal Approach for Accelerating Reinforcement Learning with Expert Guidance Wenjun Cao 52 0 0 26 Apr 2025
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality Zelei Cheng Xin-Qiang Cai Yuting Tang Pushi Zhang Boming Yang Masashi Sugiyama Xinyu Xing 49 0 0 10 Mar 2025
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters Teng Xiao Yige Yuan Ziyang Chen Mingxiao Li Shangsong Liang Zhaochun Ren V. Honavar 105 6 0 21 Feb 2025
Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope? Michael Doherty Robin Matzner Rasoul Sadeghi Polina Bayvel Alejandra Beghelli 67 0 0 18 Feb 2025
Evolution and The Knightian Blindspot of Machine Learning Joel Lehman Elliot Meyerson Tarek El-Gaaly Kenneth O. Stanley Tarin Ziyaee 99 2 0 22 Jan 2025
RRM: Robust Reward Model Training Mitigates Reward Hacking Tianqi Liu Wei Xiong Jie Jessie Ren Lichang Chen Junru Wu ... Yuan Liu Bilal Piot Abe Ittycheriah Aviral Kumar Mohammad Saleh AAML 56 15 0 20 Sep 2024
From Lists to Emojis: How Format Bias Affects Model Alignment Xuanchang Zhang Wei Xiong Lichang Chen Dinesh Manocha Heng Huang Tong Zhang ALM 37 11 0 18 Sep 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future Buhua Liu Shitong Shao Bao Li Lichen Bai Zhiqiang Xu Haoyi Xiong James Kwok Sumi Helal Zeke Xie 49 12 0 11 Sep 2024
Simplifying Deep Temporal Difference Learning Matteo Gallici Mattie Fellows Benjamin Ellis B. Pou Ivan Masmitja Jakob Foerster Mario Martin OffRL 62 17 0 05 Jul 2024
RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning Mingqi Yuan Roger Creus Castanyer Bo Li Xin Jin Glen Berseth Wenjun Zeng 40 0 0 29 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer Zhihan Liu Miao Lu Shenao Zhang Boyi Liu Hongyi Guo Yingxiang Yang Jose H. Blanchet Zhaoran Wang 53 43 0 26 May 2024
DPO Meets PPO: Reinforced Token Optimization for RLHF Han Zhong Zikang Shan Guhao Feng Li Zhao Xinle Cheng Jiang Bian Di He Jiang Bian Liwei Wang 63 57 0 29 Apr 2024
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards Haoxiang Wang Yong Lin Wei Xiong Rui Yang Shizhe Diao Shuang Qiu Han Zhao Tong Zhang 40 72 0 28 Feb 2024
An Invitation to Deep Reinforcement Learning Bernhard Jaeger Andreas Geiger OffRL OOD 80 5 0 13 Dec 2023
Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing Feiyang Han Yimin Wei Zhaofeng Liu Yanxing Qi 43 1 0 24 Nov 2023
Improving Emotional Expression and Cohesion in Image-Based Playlist Description and Music Topics: A Continuous Parameterization Approach Yuelyu Ji Yuheng Song Wei Wang Ruoyi Xu Zhongqian Xie Huiyun Liu DiffM 43 1 0 02 Oct 2023
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment Tianhao Wu Banghua Zhu Ruoyu Zhang Zhaojin Wen Kannan Ramchandran Jiantao Jiao 44 55 0 30 Sep 2023
Secrets of RLHF in Large Language Models Part I: PPO Rui Zheng Shihan Dou Songyang Gao Yuan Hua Wei Shen ... Hang Yan Tao Gui Qi Zhang Xipeng Qiu Xuanjing Huang ALM OffRL 55 160 0 11 Jul 2023
Revisiting the Minimalist Approach to Offline Reinforcement Learning Denis Tarasov Vladislav Kurenkov Alexander Nikulin Sergey Kolesnikov OffRL 38 37 0 16 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment Hanze Dong Wei Xiong Deepanshu Goyal Yihan Zhang Winnie Chow Rui Pan Shizhe Diao Jipeng Zhang Kashun Shum Tong Zhang ALM 18 410 0 13 Apr 2023
Curiosity-driven Exploration in Sparse-reward Multi-agent Reinforcement Learning Jiong Li Pratik Gajane 39 4 0 21 Feb 2023
Maneuver Decision-Making For Autonomous Air Combat Through Curriculum Learning And Reinforcement Learning With Sparse Rewards Yuxin Wei Hong-Peng Zhang Chang Huang 18 3 0 12 Feb 2023
Joint action loss for proximal policy optimization Xiulei Song Yi-Fan Jin Greg Slabaugh Simon Lucas 21 0 0 26 Jan 2023
Optimal scheduling of island integrated energy systems considering multi-uncertainties and hydrothermal simultaneous transmission: A deep reinforcement learning approach Yang Li Fanjin Bu Yuanzheng Li Chao Long 25 87 0 27 Dec 2022
Deep Black-Box Reinforcement Learning with Movement Primitives Fabian Otto Onur Celik Hongyi Zhou Hanna Ziesche Ngo Anh Vien Gerhard Neumann OffRL 24 19 0 18 Oct 2022
Towards a Standardised Performance Evaluation Protocol for Cooperative MARL R. Gorsane Omayma Mahjoub Ruan de Kock Roland Dubb Siddarth S. Singh Arnu Pretorius OffRL 44 50 0 21 Sep 2022
Understanding reinforcement learned crowds Ariel Kwiatkowski Vicky Kalogeiton Julien Pettré Marie-Paule Cani 24 10 0 19 Sep 2022
Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to Cooperative MARL J. Kuba Xidong Feng Shiyao Ding Hao Dong Jun Wang Yaodong Yang 26 16 0 02 Aug 2022
Post-processing Networks: Method for Optimizing Pipeline Task-oriented Dialogue Systems using Reinforcement Learning Atsumoto Ohashi Ryuichiro Higashinaka OffRL 24 7 0 25 Jul 2022
MetaSlicing: A Novel Resource Allocation Framework for Metaverse N. Chu D. Hoang Diep N. Nguyen Khoa T. Phan E. Dutkiewicz Dusist Niyato Tao Shu 44 46 0 23 May 2022
Combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task Vittorio Giammarino Matthew F. Dunne Kylie N. Moore Michael Hasselmo Chantal E. Stern I. Paschalidis OffRL 39 5 0 11 Mar 2022
Deep Learning Reproducibility and Explainable AI (XAI) Anastasia-Maria Leventi-Peetz T. Östreich 19 9 0 23 Feb 2022
You May Not Need Ratio Clipping in PPO Mingfei Sun Vitaly Kurin Guoqing Liu Sam Devlin Tao Qin Katja Hofmann Shimon Whiteson 18 15 0 31 Jan 2022
Mirror Learning: A Unifying Framework of Policy Optimisation J. Kuba Christian Schroeder de Witt Jakob N. Foerster 29 24 0 07 Jan 2022
A Closer Look at Advantage-Filtered Behavioral Cloning in High-Noise Datasets J. E. Grigsby Yanjun Qi OffRL 34 5 0 10 Oct 2021
Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer Yining Ma Jingwen Li Zhiguang Cao Wen Song Le Zhang Zhenghua Chen Jing Tang 83 130 0 06 Oct 2021
A Pragmatic Look at Deep Imitation Learning Kai Arulkumaran D. Lillrank 35 9 0 04 Aug 2021
Solve routing problems with a residual edge-graph attention neural network Kun Lei Peng Guo Yi Wang Xiao Wu Wenchao Zhao 37 54 0 06 May 2021
Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning Jian Hu Siyang Jiang Seth Austin Harding Haibin Wu Shihua Liao 24 86 0 06 Feb 2021
Differentiable Trust Region Layers for Deep Reinforcement Learning Fabian Otto P. Becker Ngo Anh Vien Hanna Ziesche Gerhard Neumann OffRL 41 19 0 22 Jan 2021
Robust Reinforcement Learning on State Observations with Learned Optimal Adversary Huan Zhang Hongge Chen Duane S. Boning Cho-Jui Hsieh 67 163 0 21 Jan 2021
POPO: Pessimistic Offline Policy Optimization Qiang He Xinwen Hou OffRL 37 10 0 26 Dec 2020
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations Huan Zhang Hongge Chen Chaowei Xiao Bo-wen Li Mingyan D. Liu Duane S. Boning Cho-Jui Hsieh AAML 47 261 0 19 Mar 2020