ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.09340
69
0

RePO: Replay-Enhanced Policy Optimization

11 June 2025
Siheng Li
Zhanhui Zhou
W. Lam
Chao Yang
Chaochao Lu
    OffRL
ArXiv (abs)PDFHTML
Main:8 Pages
2 Figures
Bibliography:3 Pages
8 Tables
Abstract

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of 18.418.418.4 and 4.14.14.1 points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by 15%15\%15% while raising the number of effective optimization steps by 48%48\%48% for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to 888. The repository can be accessed atthis https URL.

View on arXiv
@article{li2025_2506.09340,
  title={ RePO: Replay-Enhanced Policy Optimization },
  author={ Siheng Li and Zhanhui Zhou and Wai Lam and Chao Yang and Chaochao Lu },
  journal={arXiv preprint arXiv:2506.09340},
  year={ 2025 }
}
Comments on this paper