22
0

Expected Sarsa(λλ) with Control Variate for Variance Reduction

Abstract

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to Expected\mathtt{Expected} Sarsa\mathtt{Sarsa}(λ\lambda) and propose a tabular ES\mathtt{ES}(λ\lambda)-CV\mathtt{CV} algorithm. We prove that if a proper estimator of value function reaches, the proposed ES\mathtt{ES}(λ\lambda)-CV\mathtt{CV} enjoys a lower variance than Expected\mathtt{Expected} Sarsa\mathtt{Sarsa}(λ\lambda). Furthermore, to extend ES\mathtt{ES}(λ\lambda)-CV\mathtt{CV} to be a convergent algorithm with linear function approximation, we propose the GES\mathtt{GES}(λ\lambda) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of GES\mathtt{GES}(λ\lambda) achieves O(1/T)\mathcal{O}(1/T), which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: GQ\mathtt{GQ}(λ\lambda), GTB\mathtt{GTB}(λ\lambda) and ABQ\mathtt{ABQ}(ζ\zeta).

View on arXiv
Comments on this paper