Truncated Proximal Policy Optimization

18 June 2025

Tiantian Fan

Lingjun Liu

Yu Yue

Jiaze Chen

Chengyi Wang

Qiying Yu

Chi Zhang

Zhiqi Lin

Ruofei Zhu

Yufeng Yuan

Xiaochen Zuo

Bole Ma

Mofan Zhang

Gaohong Liu

Ru Zhang

Haotian Zhou

Cong Xie

Ruidong Zhu

Zhi Zhang

Xin Liu

Mingxuan Wang

Lin Yan

Yonghui Wu

OffRL

LRM

ArXiv (abs)PDF HTML

Main:10 Pages

3 Figures

Bibliography:1 Pages

Abstract

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.

View on arXiv

@article{fan2025_2506.15050,
  title={ Truncated Proximal Policy Optimization },
  author={ Tiantian Fan and Lingjun Liu and Yu Yue and Jiaze Chen and Chengyi Wang and Qiying Yu and Chi Zhang and Zhiqi Lin and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Bole Ma and Mofan Zhang and Gaohong Liu and Ru Zhang and Haotian Zhou and Cong Xie and Ruidong Zhu and Zhi Zhang and Xin Liu and Mingxuan Wang and Lin Yan and Yonghui Wu },
  journal={arXiv preprint arXiv:2506.15050},
  year={ 2025 }
}

Comments on this paper