Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Abstract
In this paper we consider the problem of learning an -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with states, actions, the discount factor , and an approximation threshold , we provide a model-free algorithm to learn an -optimal policy with sample complexity (where the notation hides poly-logarithmic factors of , and ) and success probability . For small enough , we show an improved algorithm with sample complexity . While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on , our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.
View on arXivComments on this paper