SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

10 April 2026

Xuyang Zhi

Peilun zhou

Chengqiang Lu

Hang Lv

Yiwei Liang

Rongyang Zhang

Yan Gao

YI WU

Yao Hu

Hongchao Gu

Defu Lian

Hao Wang

Enhong Chen

SyDa

LRM

ArXiv (abs)PDF HTML Github

Main:8 Pages

9 Figures

Bibliography:3 Pages

6 Tables

Appendix:9 Pages

Abstract

The evolution of Large Language Models (LLMs) is shifting the focus from single, verifiable tasks toward complex, open-ended real-world scenarios, imposing significant challenges on the post-training phase. In these settings, the scale and complexity of reward systems have grown significantly, transitioning toward multi-objective formulations that encompass a comprehensive spectrum of model capabilities and application contexts. However, traditional methods typically rely on fixed reward weights, ignoring non-stationary learning dynamics and struggling with data heterogeneity across dimensions. To address these issues, we propose SPARD, a framework that establishes an automated, self-paced curriculum by perceiving learning progress to dynamically adjust multi-objective reward weights and data importance, thereby synchronizing learning intent with data utility for optimal performance. Extensive experiments across multiple benchmarks demonstrate that SPARD significantly enhances model capabilities across all domains.

View on arXiv

Comments on this paper