We introduce and analyze a form of variance-reduced -learning. For -discounted MDPs with finite state space and action space , we prove that it yields an -accurate estimate of the optimal -function in the -norm using samples, where . This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary -learning has worst-case quartic scaling in the discount complexity.
View on arXiv