Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

17 June 2025

Zhoujun Cheng

Shibo Hao

Tianyang Liu

Fan Zhou

Yutao Xie

Feng Yao

Yuexin Bian

Yonghao Zhuang

Nilabjo Dey

Yuheng Zha

Yi Gu

Kun Zhou

Yuqi Wang

Yuan Li

Richard Fan

Jianshu She

Chengqian Gao

Abulhair Saparov

Haonan Li

Taylor W. Killian

Mikhail Yurochkin

Zhengzhong Liu

Eric P. Xing

Zhiting Hu

Main:11 Pages

12 Figures

Bibliography:5 Pages

7 Tables

Appendix:20 Pages

Abstract

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at:this https URL

View on arXiv

@article{cheng2025_2506.14965,
  title={ Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective },
  author={ Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu },
  journal={arXiv preprint arXiv:2506.14965},
  year={ 2025 }
}

Comments on this paper