Towards Fundamental Limits of Multi-armed Bandits with Random Walk
Feedback
In this paper, we consider a new Multi-Armed Bandit (MAB) problem where arms are nodes in an unknown and possibly changing graph, and the agent (i) initiates random walks over the graph by pulling arms, (ii) observes the random walk trajectories, and (iii) receives rewards equal to the lengths of the walks. We provide a comprehensive understanding of this problem by studying both the stochastic and the adversarial setting. In the stochastic setting, we show that this problem is not easier than a standard MAB, although additional information is available through random walk trajectories. In the adversarial setting, we show that an extension of the exponential weight algorithm can achieve a regret bound of order $\widetilde{\mathcal{O}} \left( \sqrt{ \kappa T}\right) $, where is a constant that depends on the structure of the graph, instead of the number of arms.
View on arXiv