Minimax Optimal Reinforcement Learning for Discounted MDPs

Abstract
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) in the tabular setting. We propose a model-based algorithm named UCBVI-, which is based on the optimism in the face of uncertainty principle and the Bernstein-type bonus. It achieves regret, where is the number of states, is the number of actions, is the discount factor and is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least . Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI- is near optimal for discounted MDPs.
View on arXivComments on this paper