35
59

QQ-learning with Logarithmic Regret

Abstract

This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal QQ-function. We prove that the optimistic QQ-learning studied in [Jin et al. 2018] enjoys a O(SApoly(H)Δminlog(SAT)){\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right) cumulative regret bound, where SS is the number of states, AA is the number of actions, HH is the planning horizon, TT is the total number of steps, and Δmin\Delta_{\min} is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of S,A,TS,A,T up to a log(SA)\log\left(SA\right) factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

View on arXiv
Comments on this paper