Towards Fundamental Limits of Multi-armed Bandits with Random Walk Feedback

3 November 2020

Abstract

Despite the ubiquitous applications of bandit learning algorithms in recommendation systems, social network, or online advertisement, where user behaviors can be modeled as a random walk over a network, few studies have utilized the network structure to improve learning efficiency. In this paper, we address this issue by providing a novel bandit learning formulation, where each arm is the starting node of a random walk in a network and the reward is the length of walk. This formulation not only captures a large number of applications in practice but also provides a framework to actively reduce learning complexity by utilizing graph structure in the random walk feedback. We provide a comprehensive understanding of this formulation by studying both the stochastic and the adversarial setting. In the stochastic setting, we observe that, there exists a difficult problem instance on which the following two seemingly conflicting facts simultaneously hold: 1. No algorithm can achieve a regret bound independent of problem intrinsics information theoretically; 2. There exists an algorithm whose performance is independent of problem intrinsics in terms of tail of mistakes. This reveals an intriguing phenomenon in general semi-bandit feedback learning problems. In the adversarial setting, we establish a novel algorithm that achieve regret bound of order $\widetilde{\mathcal{O}} \left( \sqrt{ \kappa T}\right) $, where $\kappa$ is a constant that depends on the structure of the graph, instead of number of arms (nodes). This bounds significantly improves regular bandit algorithms, whose complexity depends on number of arms (nodes).

View on arXiv

Comments on this paper