Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

12 February 2026

Futing Wang

Jianhao Yan

Yun Luo

Ganqu Cui

Zhi Wang

Xiaoye Qu

Yue Zhang

Yu Cheng

Tao Lin

OffRL

ReLM

LRM

ArXiv (abs)PDF HTML HuggingFace (30 upvotes)Github (19192★)

Main:7 Pages

17 Figures

Bibliography:3 Pages

9 Tables

Appendix:10 Pages

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context.Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''.To bridge this gap, we propose Length-Incentivized Exploration(\method).This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner.Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration.As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

View on arXiv

Comments on this paper