We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of arms to change over time without restriction. Assuming the -dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of rounds is known to scale as . Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order in terms of the cumulative second moment of the learner's losses , and a closely related first-order bound of order in terms of the cumulative loss of the best policy . Since or may be significantly smaller than , these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.
View on arXiv