A universal policy wrapper with guarantees

We introduce a universal policy wrapper for reinforcement learning agents that ensures formal goal-reaching guarantees. In contrast to standard reinforcement learning algorithms that excel in performance but lack rigorous safety assurances, our wrapper selectively switches between a high-performing base policy -- derived from any existing RL method -- and a fallback policy with known convergence properties. Base policy's value function supervises this switching process, determining when the fallback policy should override the base policy to ensure the system remains on a stable path. The analysis proves that our wrapper inherits the fallback policy's goal-reaching guarantees while preserving or improving upon the performance of the base policy. Notably, it operates without needing additional system knowledge or online constrained optimization, making it readily deployable across diverse reinforcement learning architectures and tasks.
View on arXiv@article{bolychev2025_2505.12354, title={ A universal policy wrapper with guarantees }, author={ Anton Bolychev and Georgiy Malaniya and Grigory Yaremenko and Anastasia Krasnaya and Pavel Osinenko }, journal={arXiv preprint arXiv:2505.12354}, year={ 2025 } }