An Optimal High Probability Algorithm for the Contextual Bandit Problem

International Conference on Artificial Intelligence and Statistics (AISTATS), 2010

22 February 2010

Abstract

We consider the problem of learning to predict with expert advice in an adversarial, on-line bandit setting. We study how to behave in a way that achieves nearly as much reward as the best expert with high probability, rather than in expectation. We provide the algorithm Exp4.P for solving this contextual bandit problem. We prove that Exp4.P competes with any set of policies or experts of size $N$ while incurring regret at most $O(\sqrt{KT\ln(N/\delta)})$ with probability $1-\delta$ , where $K$ is the number of actions and $T$ is the number of rounds of interaction. This guarantee improves on all previous algorithms for this problem, whether in a stochastic or adversarial setting. We also test the new algorithm experimentally.

View on arXiv

Comments on this paper