Variance-reduced $Q$ -learning is minimax optimal

11 June 2019

Abstract

We introduce and analyze a form of variance-reduced $Q$ -learning. For $\gamma$ -discounted MDPs with finite state space $\mathcal{X}$ and action space $\mathcal{U}$ , we prove that it yields an $\epsilon$ -accurate estimate of the optimal $Q$ -function in the $\ell_\infty$ -norm using $\mathcal{O} \left(\left(\frac{D}{ \epsilon^2 (1-\gamma)^3} \right) \; \log \left( \frac{D}{(1-\gamma)} \right) \right)$ samples, where $D = |\mathcal{X}| \times |\mathcal{U}|$ . This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity, and is the first form of model-free $Q$ -learning proven to achieve the worst-case optimal cubic scaling in the discount complexity parameter $1/(1-\gamma)$ accompanied by optimal linear scaling in the state and action space sizes. By contrast, our past work shows that ordinary $Q$ -learning has worst-case quartic scaling in the discount complexity.

View on arXiv

Comments on this paper

Variance-reduced QQQ-learning is minimax optimal

Variance-reduced $Q$ -learning is minimax optimal