On Convergence of Gradient Expected Sarsa()

We study the convergence of with linear function approximation. We show that applying the off-line estimate (multi-step bootstrapping) to is unstable for off-policy learning. Furthermore, based on convex-concave saddle-point framework, we propose a convergent () algorithm. The theoretical analysis shows that our converges to the optimal solution at a linear convergence rate, which is comparable to extensive existing state-of-the-art gradient temporal difference learning algorithms. Furthermore, we develop a Lyapunov function technique to investigate how the step-size influences finite-time performance of , such technique of Lyapunov function can be potentially generalized to other GTD algorithms. Finally, we conduct experiments to verify the effectiveness of our .
View on arXiv