ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.04746
89
0

Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models

5 June 2025
Fei Ding
Baiqiao Wang
Zijian Zeng
Youwei Wang
    LRM
ArXiv (abs)PDFHTML
Abstract

The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate supervision in GRPO frequently leads to inefficient exploration dynamics. A single error in a complex reasoning chain can invalidate the entire solution, resulting in abrupt reward vanishing and compromising trainingthis http URLaddress these challenges, we propose MGRPO (Multi-layer GRPO). MGRPO operates in two layers: the first layer employs standard GRPO to generate an initial response. This response, along with the original query, is then fed into a second-layer GRPO process. This second layer is specifically trained to identify and correct errors in the initial response, effectively creating a self-correction loop. This mechanism provides implicit process-level supervision by rewarding successful error correction, without requiring an explicit, densely-annotated reward model. Experimental results on several mathematical reasoning benchmarks demonstrate that MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.

View on arXiv
@article{ding2025_2506.04746,
  title={ Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models },
  author={ Fei Ding and Baiqiao Wang and Zijian Zeng and Youwei Wang },
  journal={arXiv preprint arXiv:2506.04746},
  year={ 2025 }
}
Comments on this paper