Variance-reduced -learning is minimax optimal
- OffRL
Abstract
We introduce and analyze a form of variance-reduced -learning. For -discounted MDPs with finite state space and action space , we prove that it yields an -accurate estimate of the optimal -function in the -norm using samples, where . This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity, and is the first form of model-free -learning proven to achieve the worst-case optimal cubic scaling in the discount complexity parameter accompanied by optimal linear scaling in the state and action space sizes. By contrast, our past work shows that ordinary -learning has worst-case quartic scaling in the discount complexity.
View on arXivComments on this paper
