v1v2v3 (latest)

Posterior sampling for reinforcement learning: worst-case regret bounds

19 May 2017

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(DS\sqrt{AT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$ . Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$ . This result closely matches the known lower bound of $\Omega(\sqrt{DSAT})$ . Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

View on arXiv

Comments on this paper