23
34

Policy Optimization with Stochastic Mirror Descent

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO\mathtt{VRMPO} algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO\mathtt{VRMPO}, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO\mathtt{VRMPO} needs only O(ϵ3)\mathcal{O}(\epsilon^{-3}) sample trajectories to achieve an ϵ\epsilon-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that VRMPO\mathtt{VRMPO} outperforms the state-of-the-art policy gradient methods in various settings.

View on arXiv
Comments on this paper