ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.14731
34
0
v1v2 (latest)

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

17 June 2025
Ling Team
Bin Hu
Cai Chen
Deng Zhao
Ding Liu
dingnan jin
Feng Zhu
Hao Dai
Hongzhi Luan
Jia Guo
Jiaming Liu
J. Wu
Jun Mei
Jun Zhou
Junbo Zhao
Junwu Xiong
Kaihong Zhang
Kuan Xu
Lei Liang
Liang Jiang
Liangcheng Fu
Longfei Zheng
Qiang Gao
Qing Cui
Quan Wan
Shaomian Zheng
Shuaicheng Li
Tongkai Yang
Wang Ren
X. Yan
Xiaopei Wan
Xiaoyun Feng
Xin Zhao
Xinxing Yang
Xinyu Kong
Xuemin Yang
Yang Li
Y. Wu
Y. Liu
Zhankai Xu
Zhenduo Zhang
Zhenglei Zhou
Zhenyu Huang
Zhiqiang Zhang
Zihao Wang
Zujie Wen
    OffRLMoEALMLRM
ArXiv (abs)PDFHTML
Main:18 Pages
2 Figures
Bibliography:3 Pages
4 Tables
Abstract

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

View on arXiv
@article{team2025_2506.14731,
  title={ Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs },
  author={ Ling Team and Bin Hu and Cai Chen and Deng Zhao and Ding Liu and Dingnan Jin and Feng Zhu and Hao Dai and Hongzhi Luan and Jia Guo and Jiaming Liu and Jiewei Wu and Jun Mei and Jun Zhou and Junbo Zhao and Junwu Xiong and Kaihong Zhang and Kuan Xu and Lei Liang and Liang Jiang and Liangcheng Fu and Longfei Zheng and Qiang Gao and Qing Cui and Quan Wan and Shaomian Zheng and Shuaicheng Li and Tongkai Yang and Wang Ren and Xiaodong Yan and Xiaopei Wan and Xiaoyun Feng and Xin Zhao and Xinxing Yang and Xinyu Kong and Xuemin Yang and Yang Li and Yingting Wu and Yongkang Liu and Zhankai Xu and Zhenduo Zhang and Zhenglei Zhou and Zhenyu Huang and Zhiqiang Zhang and Zihao Wang and Zujie Wen },
  journal={arXiv preprint arXiv:2506.14731},
  year={ 2025 }
}
Comments on this paper