Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood

10 June 2025

Abstract

Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the $Q$ -value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the $Q$ -function generalization. This over-constraint issue results in poor $Q$ -value estimation and hinders policy improvement. In this paper, we introduce a novel approach to achieve better $Q$ -value estimation by enhancing $Q$ -function generalization in OOD regions within Convex Hull and its Neighborhood (CHN). Under the safety generalization guarantees of the CHN, we propose the Smooth Bellman Operator (SBO), which updates OOD $Q$ -values by smoothing them with neighboring in-sample $Q$ -values. We theoretically show that SBO approximates true $Q$ -values for both in-sample and OOD actions within the CHN. Our practical algorithm, Smooth Q-function OOD Generalization (SQOG), empirically alleviates the over-constraint issue, achieving near-accurate $Q$ -value estimation. On the D4RL benchmarks, SQOG outperforms existing state-of-the-art methods in both performance and computational efficiency.

View on arXiv

@article{yao2025_2506.08417,
  title={ Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood },
  author={ Qingmao Yao and Zhichao Lei and Tianyuan Chen and Ziyue Yuan and Xuefan Chen and Jianxiang Liu and Faguo Wu and Xiao Zhang },
  journal={arXiv preprint arXiv:2506.08417},
  year={ 2025 }
}

Comments on this paper