156

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

Chen Yang
Guangyue Peng
Jiaying Zhu
Ran Le
Ruixiang Feng
Tao Zhang
Wei Ruan
Xiaoqi Liu
Xiaoxue Cheng
Xiyun Xu
Yang Song
Yanzipeng Gao
Yiming Jia
Yun Xing
Yuntao Wen
Zekai Wang
Zhenwei An
Zhicong Sun
Zongchao Chen
Main:11 Pages
5 Figures
Bibliography:4 Pages
5 Tables
Appendix:1 Pages
Abstract

We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available atthis https URL.

View on arXiv
Comments on this paper