Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

We study a sequential decision problem where the learner faces a sequence of -armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer , the learner aims to compete with the best subset of arms of size . We design an algorithm based on a reduction to bandit submodular maximization, and show that, for rounds comprised of tasks, in the regime of large number of tasks and small number of optimal arms , its regret in both settings is smaller than the simple baseline of that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length , we show that the regret of the algorithm is bounded as . Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved regret.
View on arXiv