Policy Optimization with Stochastic Mirror Descent

Abstract
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes algorithm: a sample efficient policy gradient method with stochastic mirror descent. In , a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed needs only sample trajectories to achieve an -approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that outperforms the state-of-the-art policy gradient methods in various settings.
View on arXivComments on this paper