Test-Time Scaling with Reflective Generative Model

2 July 2025

Zixiao Wang

Yuxin Wang

Xiaorui Wang

Mengting Xing

Jie Gao

Jianjun Xu

Guangcan Liu

Chenhui Jin

Zhuo Wang

Shengzhuo Zhang

Hongtao Xie

LRM

ArXiv (abs)PDF HTML

Main:14 Pages

8 Figures

Bibliography:1 Pages

4 Tables

Appendix:1 Pages

Abstract

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 atthis https URL.

View on arXiv

@article{wang2025_2507.01951,
  title={ Test-Time Scaling with Reflective Generative Model },
  author={ Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie },
  journal={arXiv preprint arXiv:2507.01951},
  year={ 2025 }
}

Comments on this paper