A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown -valued stationary -mixing sequence of parameters which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the -dependence between consecutive . An optimistic algorithm, called LinMix-UCB, is proposed for the case where has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of with respect to an oracle that always plays a multiple of . The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of .
View on arXiv