Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.01679
Cited By
v1
v2 (latest)
VinePPO: Refining Credit Assignment in RL Training of LLMs
2 October 2024
Amirhossein Kazemnejad
Milad Aghajohari
Eva Portelance
Alessandro Sordoni
Siva Reddy
Rameswar Panda
Nicolas Le Roux
OffRL
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VinePPO: Refining Credit Assignment in RL Training of LLMs"
11 / 11 papers shown
Title
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Yiping Wang
Qing Yang
Zhiyuan Zeng
Liliang Ren
Liu Liu
...
Jianfeng Gao
Weizhu Chen
Shuaiqiang Wang
Simon Shaolei Du
Yelong Shen
OffRL
ReLM
LRM
299
47
0
29 Apr 2025
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
Taiwei Shi
Yiyang Wu
Linxin Song
Dinesh Manocha
Jieyu Zhao
LRM
153
15
0
07 Apr 2025
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu
Zheng Zhang
Ruofei Zhu
Yufeng Yuan
Xiaochen Zuo
...
Ya Zhang
Lin Yan
Mu Qiao
Yonghui Wu
Mingxuan Wang
OffRL
LRM
206
218
0
18 Mar 2025
T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
Zhenyu Hou
Xin Lv
Rui Lu
Jing Zhang
Yongqian Li
Zijun Yao
Juanzi Li
J. Tang
Yuxiao Dong
OffRL
LRM
ReLM
142
33
0
20 Jan 2025
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
Rishabh Agarwal
Rameswar Panda
OffRL
146
11
0
23 Oct 2024
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Shusheng Xu
Wei Fu
Jiaxuan Gao
Wenjie Ye
Weiling Liu
Zhiyu Mei
Guangju Wang
Chao Yu
Yi Wu
147
165
0
16 Apr 2024
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou
Frank F. Xu
Hao Zhu
Xuhui Zhou
Robert Lo
...
Tianyue Ou
Yonatan Bisk
Daniel Fried
Uri Alon
Graham Neubig
LLMAG
178
494
0
25 Jul 2023
Reasoning with Language Model is Planning with World Model
Shibo Hao
Yi Gu
Haodi Ma
Joshua Jiahua Hong
Zhen Wang
D. Wang
Zhiting Hu
ReLM
LRM
LLMAG
156
600
0
24 May 2023
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
485
1,768
0
18 Sep 2019
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
547
19,296
0
20 Jul 2017
Thinking Fast and Slow with Deep Learning and Tree Search
Thomas W. Anthony
Zheng Tian
David Barber
130
402
0
23 May 2017
1