A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent

Abstract
In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here is an upper bound on the span of the bias function, is the number of states, is the number of joint actions and is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in .
View on arXivComments on this paper