Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays
We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are -bounded, in our setting, losses can fall in a general bounded interval , unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses -Tsallis entropy regularizer and can achieve regret when the losses are non-negative, where is the number of actions, is the number of steps, and is the total feedback delay. This bound nearly matches the lower-bound when regarding as a constant independent of . The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound , which falls to the regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., ) scale-free adversarial MAB problems, which can be of independent interest.
View on arXiv