We consider -armed stochastic bandits and consider cumulative regret bounds up to time . We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order and a distribution-dependent regret that is asymptotically optimal, that is, matching the lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where is the optimal problem-dependent constant. This constant depends on the model considered (the family of possible distributions over the arms). M\énard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than . We extend this result to the non-parametric case of all distributions over . We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order , and the KL-UCB strategy by Capp\é et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent regret bound in the model of all distributions over . We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for -armed stochastic bandits.
View on arXiv