92
0
v1v2 (latest)

Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Main:9 Pages
11 Figures
Bibliography:3 Pages
1 Tables
Appendix:3 Pages
Abstract

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

View on arXiv
@article{xu2025_2505.16142,
  title={ Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning },
  author={ Shicheng Xu and Liang Pang and Yunchang Zhu and Jia Gu and Zihao Wei and Jingcheng Deng and Feiyang Pan and Huawei Shen and Xueqi Cheng },
  journal={arXiv preprint arXiv:2505.16142},
  year={ 2025 }
}
Comments on this paper