59
37

Posterior sampling for reinforcement learning: worst-case regret bounds

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of O~(DSAT)\tilde{O}(D\sqrt{SAT}) for any communicating MDP with SS states, AA actions and diameter DD, when TS5AT\ge S^5A. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon TT. This result improves over the best previously known upper bound of O~(DSAT)\tilde{O}(DS\sqrt{AT}) achieved by any algorithm in this setting, and matches the dependence on SS in the established lower bound of Ω(DSAT)\Omega(\sqrt{DSAT}) for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

View on arXiv
Comments on this paper