ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1707.06347
  4. Cited By
Proximal Policy Optimization Algorithms
v1v2 (latest)

Proximal Policy Optimization Algorithms

20 July 2017
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
    OffRL
ArXiv (abs)PDFHTML

Papers citing "Proximal Policy Optimization Algorithms"

50 / 8,601 papers shown
Title
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
Yiqing Liang
Jielin Qiu
Wenhao Ding
Zuxin Liu
James Tompkin
Mengdi Xu
Mengzhou Xia
Zhengzhong Tu
Laixi Shi
Jiacheng Zhu
OffRL
125
0
0
30 May 2025
SignBot: Learning Human-to-Humanoid Sign Language Interaction
SignBot: Learning Human-to-Humanoid Sign Language Interaction
Guanren Qiao
Sixu Lin
Ronglai Zuo Zhizheng Wu
Kui Jia
Kui Jia
Guiliang Liu
SLR
53
0
0
30 May 2025
How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning
How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning
Hongyi Cai
Junlin Wang
Xiaoyin Chen
Bhuwan Dhingra
LRM
19
0
0
30 May 2025
Compiler-R1: Towards Agentic Compiler Auto-tuning with Reinforcement Learning
Compiler-R1: Towards Agentic Compiler Auto-tuning with Reinforcement Learning
Haolin Pan
Hongyu Lin
Haoran Luo
Yang Liu
Kaichun Yao
Libo Zhang
Mingjie Xing
Yanjun Wu
OffRLLRM
5
0
0
30 May 2025
RAST: Reasoning Activation in LLMs via Small-model Transfer
RAST: Reasoning Activation in LLMs via Small-model Transfer
Siru Ouyang
Xinyu Zhu
Zilin Xiao
Minhao Jiang
Yu Meng
Jiawei Han
OffRLReLMLRM
20
0
0
30 May 2025
Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation
Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation
Ximing Xing
Yandong Guan
Jing Zhang
Dong Xu
Qian Yu
LRM
64
0
0
30 May 2025
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Shelly Bensal
Umar Jamil
Christopher Bryant
M. Russak
Kiran Kamble
Dmytro Mozolevskyi
Muayad Ali
Waseem Alshikh
LLMAGReLMLRM
23
0
0
30 May 2025
Autonomous Behavior and Whole-Brain Dynamics Emerge in Embodied Zebrafish Agents with Model-based Intrinsic Motivation
Autonomous Behavior and Whole-Brain Dynamics Emerge in Embodied Zebrafish Agents with Model-based Intrinsic Motivation
Reece Keller
Alyn Tornell
Felix Pei
Xaq Pitkow
Leo Kozachkov
Aran Nayebi
15
0
0
30 May 2025
A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming
A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming
Yizhong Ding
AAML
15
0
0
30 May 2025
Reinforcing Video Reasoning with Focused Thinking
Reinforcing Video Reasoning with Focused Thinking
Jisheng Dang
Jingze Wu
T. Wang
Xuanhui Lin
Nannan Zhu
Hongbo Chen
Wei-Shi Zheng
Meng Wang
Tat-Seng Chua
OffRLLRM
31
0
0
30 May 2025
Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation
Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation
Ahmed Elhady
Eneko Agirre
Mikel Artetxe
CLLKELMELM
24
0
0
30 May 2025
Proactive Guidance of Multi-Turn Conversation in Industrial Search
Proactive Guidance of Multi-Turn Conversation in Industrial Search
Xiaoyu Li
Xiao Li
Li Gao
Yiding Liu
Xiaoyang Wang
Shuaiqiang Wang
Junfeng Wang
Dawei Yin
LLMAG
24
0
0
30 May 2025
Navigation of a Three-Link Microswimmer via Deep Reinforcement Learning
Navigation of a Three-Link Microswimmer via Deep Reinforcement Learning
Yuyang Lai
Sina Heydari
On Shun Pak
Yi Man
15
0
0
30 May 2025
Multiple LLM Agents Debate for Equitable Cultural Alignment
Multiple LLM Agents Debate for Equitable Cultural Alignment
Dayeon Ki
Rachel Rudinger
Tianyi Zhou
Marine Carpuat
LLMAG
21
0
0
30 May 2025
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards
Xun Lu
Yunyi Yang
Yongbo Gai
Kai Luo
Shihao Huang
Jianhe Lin
Xiaoxi Jiang
Guanjun Jiang
31
0
0
30 May 2025
ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving
ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving
Yongming Chen
Miner Chen
Liewen Liao
Mingyang Jiang
Xiang Zuo
Hengrui Zhang
Yuchen Xi
Songan Zhang
27
0
0
30 May 2025
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu
Jiaxuan Gao
Xujie Shen
Chen Zhu
Zhiyu Mei
...
Jun Mei
Jiashu Wang
Tongkai Yang
Binhang Yuan
Yi Wu
OffRLSyDaLRM
53
0
0
30 May 2025
Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Paul Gölz
Nika Haghtalab
Kunhe Yang
32
0
0
29 May 2025
Diffusion Guidance Is a Controllable Policy Improvement Operator
Diffusion Guidance Is a Controllable Policy Improvement Operator
Kevin Frans
Seohong Park
Pieter Abbeel
Sergey Levine
OffRL
62
0
0
29 May 2025
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Zeyu Liu
Y. Liu
Guanghao Zhu
C. Xie
Zhen Li
...
Qing Li
Shing-Chi Cheung
Shengyu Zhang
Fei Wu
Hongxia Yang
ReLMLRM
71
0
0
29 May 2025
Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation
Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation
Caiqi Zhang
Xiaochen Zhu
Chengzu Li
Nigel Collier
Andreas Vlachos
OffRLHILM
42
1
0
29 May 2025
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Yiran Guo
Lijie Xu
Jie Liu
Dan Ye
Shuang Qiu
OffRL
80
0
0
29 May 2025
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
Aneeshan Sain
Subhajit Maity
Pinaki Nath Chowdhury
Subhadeep Koley
A. Bhunia
Yi-Zhe Song
3DH
59
0
0
29 May 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Chenyu Yang
Shiqian Su
Shi-Qi Liu
Xuan Dong
Yue Yu
...
Hao Li
Wenhai Wang
Yu Qiao
Xizhou Zhu
Jifeng Dai
OffRL
142
0
0
29 May 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
50
0
0
29 May 2025
Composite Reward Design in PPO-Driven Adaptive Filtering
Composite Reward Design in PPO-Driven Adaptive Filtering
Abdullah Burkan Bereketoglu
13
0
0
29 May 2025
Accelerating RLHF Training with Reward Variance Increase
Accelerating RLHF Training with Reward Variance Increase
Zonglin Yang
Zhexuan Gu
Houduo Qi
Yancheng Yuan
79
0
0
29 May 2025
AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning
AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning
Lucas N. Alegre
Agon Serifi
Ruben Grandia
David Müller
Espen Knoop
Moritz Bächer
48
0
0
29 May 2025
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Chenbin Pan
Wenbin He
Zhengzhong Tu
Liu Ren
LRMVLM
58
0
0
29 May 2025
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
Ruida Wang
Yuxin Li
Yi R.
Fung
LRM
82
1
0
29 May 2025
Learning coordinated badminton skills for legged manipulators
Learning coordinated badminton skills for legged manipulators
Yuntao Ma
Andrei Cramariuc
Farbod Farshidian
Marco Hutter
37
1
0
29 May 2025
LocoTouch: Learning Dexterous Quadrupedal Transport with Tactile Sensing
LocoTouch: Learning Dexterous Quadrupedal Transport with Tactile Sensing
Changyi Lin
Yuxin Ray Song
Boda Huo
Mingyang Yu
Yikai Wang
...
Wenhao Yu
Tingnan Zhang
Jie Tan
Yiyue Luo
Ding Zhao
44
0
0
29 May 2025
Grower-in-the-Loop Interactive Reinforcement Learning for Greenhouse Climate Control
Grower-in-the-Loop Interactive Reinforcement Learning for Greenhouse Climate Control
Maxiu Xiao
Jianglin Lan
Jingxing Yu
Eldert van Henten
OffRLAI4CE
49
0
0
29 May 2025
Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders
Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders
Wei-Hsiang Huang
Chen-Wei Ke
Wei-Ning Chiu
Yu-Xuan Su
Chun-Chun Yang
Chieh-Yuan Cheng
Yun-Nung Chen
Pu-Jen Cheng
74
0
0
29 May 2025
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Lang Cao
Jingxian Xu
Hanbing Liu
Jinyu Wang
Mengyu Zhou
Haoyu Dong
Shi Han
Dongmei Zhang
LRMOffRLLMTDReLM
58
0
0
29 May 2025
Enhanced DACER Algorithm with High Diffusion Efficiency
Enhanced DACER Algorithm with High Diffusion Efficiency
Yinuo Wang
Mining Tan
Wenjun Zou
Haotian Lin
Xujie Song
...
Guojian Zhan
Tianze Zhu
Shiqi Liu
Jingliang Duan
Shengbo Eben Li
DiffM
70
0
0
29 May 2025
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
Sungjune Park
Hyunjun Kim
Junho Kim
S. T. Kim
Y. Ro
LRM
123
0
0
29 May 2025
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
Guangchen Lan
Huseyin A. Inan
Sahar Abdelnabi
Janardhan Kulkarni
Lukas Wutschitz
Reza Shokri
Christopher G. Brinton
Robert Sim
12
1
0
29 May 2025
Learning to Search for Vehicle Routing with Multiple Time Windows
Learning to Search for Vehicle Routing with Multiple Time Windows
Kuan Xu
Zhiguang Cao
Chenlong Zheng
Linong Liu
12
0
0
29 May 2025
ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
Peixuan Han
Zijia Liu
Jiaxuan You
LLMAG
87
0
0
29 May 2025
Discriminative Policy Optimization for Token-Level Reward Models
Discriminative Policy Optimization for Token-Level Reward Models
Hongzhan Chen
Tao Yang
Shiping Gao
Ruijun Chen
Xiaojun Quan
Hongtao Tian
Ting Yao
33
0
0
29 May 2025
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
Sheng Ouyang
Yulan Hu
Ge Chen
Qingyang Li
Fuzheng Zhang
Yong Liu
27
0
0
29 May 2025
Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
Matteo Gallici
Haitz Sáez de Ocáriz Borde
32
0
0
29 May 2025
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
Zhitao He
Sandeep Polisetty
Zhiyuan Fan
Yuchen Huang
Shujin Wu
Yi R.
LRM
64
2
0
29 May 2025
Learning Parametric Distributions from Samples and Preferences
Learning Parametric Distributions from Samples and Preferences
Marc Jourdan
Gizem Yüce
Nicolas Flammarion
15
0
0
29 May 2025
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Seohyeong Lee
Eunwon Kim
Hwaran Lee
Buru Chang
61
0
0
29 May 2025
ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork
ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork
Caroline Wang
Arrasy Rahman
Jiaxun Cui
Yoonchang Sung
Peter Stone
58
0
0
29 May 2025
Training Language Models to Generate Quality Code with Program Analysis Feedback
Training Language Models to Generate Quality Code with Program Analysis Feedback
Feng Yao
Zilong Wang
Liyuan Liu
Junxia Cui
Li Zhong
Xiaohan Fu
Haohui Mai
Vish Krishnan
Jianfeng Gao
Jingbo Shang
52
0
0
28 May 2025
Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement Learning
Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement Learning
Naoto Yoshida
Tadahiro Taniguchi
22
0
0
28 May 2025
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition
Hanting Chen
Yasheng Wang
Kai Han
Dong Li
Lin Li
...
Hailin Hu
Yehui Tang
Dacheng Tao
Xinghao Chen
Yunhe Wang
LRM
93
0
0
28 May 2025
Previous
123...567...171172173
Next