Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation

29 April 2025

Harry Mead

Clarissa Costen

Bruno Lacerda

Nick Hawes

Abstract

When optimising for conditional value at risk (CVaR) using policy gradients (PG), current meth- ods rely on discarding a large proportion of tra- jectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajecto- ries used in training, rather than simply discard- ing them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the prob- lem results in consistently improved performance compared to baselines.

View on arXiv

@article{mead2025_2504.20887,
  title={ Return Capping: Sample-Efficient CVaR Policy Gradient Optimisation },
  author={ Harry Mead and Clarissa Costen and Bruno Lacerda and Nick Hawes },
  journal={arXiv preprint arXiv:2504.20887},
  year={ 2025 }
}

Comments on this paper