57
37
v1v2v3 (latest)

Posterior sampling for reinforcement learning: worst-case regret bounds

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of O~(DSAT)\tilde{O}(DS\sqrt{AT}) for any communicating MDP with SS states, AA actions and diameter DD. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon TT. This result closely matches the known lower bound of Ω(DSAT)\Omega(\sqrt{DSAT}). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

View on arXiv
Comments on this paper