Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.23433
Cited By
Diversity-Aware Policy Optimization for Large Language Model Reasoning
29 May 2025
Jian Yao
Ran Cheng
Xingyu Wu
Jibin Wu
Kay Chen Tan
Author Contacts:
nigel97.yao@connect.polyu.hk
ran-peter.cheng@polyu.edu.hk
xingy.wu@polyu.edu.hk
jibin.wu@polyu.edu.hk
kctan@polyu.edu.hk
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Diversity-Aware Policy Optimization for Large Language Model Reasoning"
41 / 41 papers shown
Title
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo
Kaiyan Zhang
Li Sheng
Li Sheng
Xuekai Zhu
...
Youbang Sun
Zhiyuan Ma
Lifan Yuan
Ning Ding
Bowen Zhou
OffRL
380
26
0
22 Apr 2025
SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM
Xinyu Zhang
Jiadong Wang
Zifei Cheng
Wenhao Zhuang
Zheng Lin
...
Shouyu Yin
Chaohang Wen
Haotian Zhang
Bin Chen
Bing Yu
LRM
138
9
0
19 Apr 2025
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue
Zhiqi Chen
Rui Lu
Andrew Zhao
Zhaokai Wang
Yang Yue
Shiji Song
Gao Huang
ReLM
LRM
184
93
0
18 Apr 2025
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue
Yufeng Yuan
Qiying Yu
Xiaochen Zuo
Ruofei Zhu
...
Ru Zhang
Xin Liu
Mingxuan Wang
Yonghui Wu
Lin Yan
OffRL
LRM
112
30
0
07 Apr 2025
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu
Changyu Chen
Wenjun Li
Penghui Qi
Tianyu Pang
Chao Du
Wee Sun Lee
Min Lin
OffRL
LRM
190
141
0
26 Mar 2025
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng
Yuzhen Huang
Qian Liu
Wei Liu
Keqing He
Zejun Ma
Junxian He
OffRL
ReLM
LRM
160
109
0
24 Mar 2025
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu
Zheng Zhang
Ruofei Zhu
Yufeng Yuan
Xiaochen Zuo
...
Ya Zhang
Lin Yan
Mu Qiao
Yonghui Wu
Mingxuan Wang
OffRL
LRM
195
175
0
18 Mar 2025
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret
Yufeng Yuan
Yu Yue
Ruofei Zhu
Tiantian Fan
Lin Yan
OffRL
97
19
0
03 Mar 2025
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
Alon Albalak
Duy Phung
Nathan Lile
Rafael Rafailov
Kanishk Gandhi
...
Anikait Singh
Chase Blagden
Violet Xiang
Dakota Mahan
Nick Haber
OffRL
LRM
86
12
0
24 Feb 2025
Process Reinforcement through Implicit Rewards
Ganqu Cui
Lifan Yuan
Ziyi Wang
Hanbin Wang
Wendi Li
...
Yu Cheng
Zhiyuan Liu
Maosong Sun
Bowen Zhou
Ning Ding
OffRL
LRM
137
92
0
03 Feb 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLM
VLM
OffRL
AI4TS
LRM
370
1,692
0
22 Jan 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team
Angang Du
Bofei Gao
Bowei Xing
Changjiu Jiang
...
Zihao Huang
Ziyao Xu
Zhiyong Yang
Zonghan Yang
Zongyu Lin
OffRL
ALM
AI4TS
VLM
LRM
248
274
0
22 Jan 2025
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Weihao Zeng
Yuzhen Huang
Lulu Zhao
Yijun Wang
Zifei Shan
Junxian He
LRM
122
15
0
23 Dec 2024
One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity
Sonia K. Murthy
Tomer Ullman
Jennifer Hu
ALM
90
13
0
07 Nov 2024
VinePPO: Refining Credit Assignment in RL Training of LLMs
Amirhossein Kazemnejad
Milad Aghajohari
Eva Portelance
Alessandro Sordoni
Siva Reddy
Rameswar Panda
Nicolas Le Roux
OffRL
LRM
70
36
0
02 Oct 2024
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang
Beichen Zhang
Binyuan Hui
Bofei Gao
Bowen Yu
...
Mingfeng Xue
Runji Lin
Tianyu Liu
Xingzhang Ren
Zhenru Zhang
OSLM
LRM
98
287
0
18 Sep 2024
Progress or Regress? Self-Improvement Reversal in Post-training
Ting Wu
Xuefeng Li
Pengfei Liu
LRM
74
13
0
06 Jul 2024
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Zhengyang Tang
Xingxing Zhang
Benyou Wang
Furu Wei
ALM
LRM
72
73
0
05 Mar 2024
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Chaoqun He
Renjie Luo
Yuzhuo Bai
Shengding Hu
Zhen Leng Thai
...
Yuxiang Zhang
Jie Liu
Lei Qi
Zhiyuan Liu
Maosong Sun
ELM
AIMat
106
273
0
21 Feb 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
138
1,119
0
05 Feb 2024
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang
Lei Li
Zhihong Shao
R. X. Xu
Damai Dai
Yifei Li
Deli Chen
Y.Wu
Zhifang Sui
AIMat
LRM
ALM
132
391
0
14 Dec 2023
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk
Ishita Mediratta
Christoforos Nalmpantis
Jelena Luketina
Eric Hambro
Edward Grefenstette
Roberta Raileanu
AI4CE
ALM
161
148
0
10 Oct 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
188
2,223
0
12 Sep 2023
Let's Verify Step by Step
Hunter Lightman
V. Kosaraju
Yura Burda
Harrison Edwards
Bowen Baker
Teddy Lee
Jan Leike
John Schulman
Ilya Sutskever
K. Cobbe
ALM
OffRL
LRM
191
1,164
0
31 May 2023
Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning
Sumeet Batra
Bryon Tjanaka
Matthew C. Fontaine
Aleksei Petrenko
Stefanos Nikolaidis
Gaurav Sukhatme
OffRL
73
17
0
23 May 2023
Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization
Zihan Zhou
Wei Fu
Bingliang Zhang
Yi Wu
65
30
0
04 Apr 2022
Approximating Gradients for Differentiable Quality Diversity in Reinforcement Learning
Bryon Tjanaka
Matthew C. Fontaine
Julian Togelius
Stefanos Nikolaidis
68
54
0
08 Feb 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
285
4,408
0
27 Oct 2021
Discovering Diverse Nearly Optimal Policies with Successor Features
Tom Zahavy
Brendan O'Donoghue
André Barreto
Volodymyr Mnih
Sebastian Flennerhag
Satinder Singh
65
21
0
01 Jun 2021
Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization
Zhen-Yu Tang
Chao Yu
Boyuan Chen
Huazhe Xu
Xiaolong Wang
Fei Fang
S. Du
Yu Wang
Yi Wu
71
53
0
08 Mar 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLM
FaML
173
2,265
0
05 Mar 2021
Multiple Plans are Better than One: Diverse Stochastic Planning
Mahsa Ghasemi
Evan Scope Crafts
Bo Zhao
Ufuk Topcu
75
7
0
31 Dec 2020
Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization
Thomas Pierrot
Valentin Macé
Félix Chalumeau
Arthur Flajolet
Geoffrey Cideron
Karim Beguir
Antoine Cully
Olivier Sigaud
Nicolas Perrin-Gilbert
55
62
0
15 Jun 2020
Non-local Policy Optimization via Diversity-regularized Collaborative Exploration
Zhenghao Peng
Hao Sun
Bolei Zhou
52
19
0
14 Jun 2020
Effective Diversity in Population Based Reinforcement Learning
Jack Parker-Holder
Aldo Pacchiano
K. Choromanski
Stephen J. Roberts
101
162
0
03 Feb 2020
Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies
M. A. Masood
Finale Doshi-Velez
42
51
0
31 May 2019
Learning Novel Policies For Tasks
Yunbo Zhang
Wenhao Yu
Greg Turk
41
34
0
13 May 2019
Understanding the impact of entropy on policy optimization
Zafarali Ahmed
Nicolas Le Roux
Mohammad Norouzi
Dale Schuurmans
73
233
0
27 Nov 2018
Diversity is All You Need: Learning Skills without a Reward Function
Benjamin Eysenbach
Abhishek Gupta
Julian Ibarz
Sergey Levine
99
1,085
0
16 Feb 2018
Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
Zhang-Wei Hong
Tzu-Yun Shann
Shih-Yang Su
Yi-Hsiang Chang
Chun-Yi Lee
59
124
0
13 Feb 2018
Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents
Edoardo Conti
Vashisht Madhavan
F. Such
Joel Lehman
Kenneth O. Stanley
Jeff Clune
63
347
0
18 Dec 2017
1