ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.01491
62
10

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret

3 March 2025
Yufeng Yuan
Yu Yue
Ruofei Zhu
Tiantian Fan
Lin Yan
    OffRL
ArXivPDFHTML
Abstract

Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.

View on arXiv
@article{yuan2025_2503.01491,
  title={ What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret },
  author={ Yufeng Yuan and Yu Yue and Ruofei Zhu and Tiantian Fan and Lin Yan },
  journal={arXiv preprint arXiv:2503.01491},
  year={ 2025 }
}
Comments on this paper