Toward Better Use of Data in Linear Bandits
In this paper, we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions, observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length . In this paper, we first introduce a general analysis framework and a family of rate optimal algorithms for the problem. We show that this family of algorithms includes well-known algorithms such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) as special cases. The proposed analysis technique directly captures complexity of uncertainty in the action sets that we show is tied to regret analysis of any policy. This insight allows us to design a new rate-optimal policy, called Sieved-Greedy (SG), that reduces the over-exploration problem in existing algorithms. SG utilizes data to discard the actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate-optimal, our empirical simulations show that SG significantly outperforms existing benchmarks such as greedy, OFUL, and TS. Moreover, our analysis technique yields a number of new results such as obtaining poly-logarithmic (in ) regret bounds for OFUL and TS, under a generalized gap assumption and a margin condition, as in literature on contextual bandits. We also improve regret bounds of these algorithms for the sub-class of -armed contextual bandit problems by a factor .
View on arXiv