24
22

Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Abstract

Consider the problem of sampling sequentially from a finite number of N2N \geq 2 populations, specified by random variables XkiX^i_k, i=1,,N, i = 1,\ldots , N, and k=1,2,k = 1, 2, \ldots; where XkiX^i_k denotes the outcome from population ii the kthk^{th} time it is sampled. It is assumed that for each fixed ii, {Xki}k1\{ X^i_k \}_{k \geq 1} is a sequence of i.i.d. normal random variables, with unknown mean μi\mu_i and unknown variance σi2\sigma_i^2. The objective is to have a policy π\pi for deciding from which of the NN populations to sample form at any time n=1,2,n=1,2,\ldots so as to maximize the expected sum of outcomes of nn samples or equivalently to minimize the regret due to lack on information of the parameters μi\mu_i and σi2\sigma_i^2. In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.

View on arXiv
Comments on this paper