599

Risk-sensitive Reinforcement Learning via Distortion Risk Measures

Abstract

We address the problem of control in a risk-sensitive reinforcement learning (RL) context via distortion risk measures (DRM). We propose policy gradient algorithms, which maximize the DRM of the cumulative reward in an episodic Markov decision process in on-policy as well as off-policy RL settings. We employ two different approaches in devising the policy gradient algorithms. In the first approach, we derive a variant of the policy gradient theorem that caters to the DRM objective, and use this theorem in conjunction with a likelihood ratio-based gradient estimation scheme. In the second approach, we estimate the DRM from the empirical distribution of cumulative rewards, and use this estimation scheme along with a smoothed functional-based gradient estimation scheme. For policy gradient algorithms using either approach, we derive non-asymptotic bounds that establish the convergence to an approximate stationary point of the DRM objective.

View on arXiv
Comments on this paper