An Optimal High Probability Algorithm for the Contextual Bandit Problem
International Conference on Artificial Intelligence and Statistics (AISTATS), 2010
- OffRL
Abstract
We consider the problem of learning to predict with expert advice in an adversarial, on-line bandit setting. We study how to behave in a way that achieves nearly as much reward as the best expert with high probability, rather than in expectation. We provide the algorithm Exp4.P for solving this contextual bandit problem. We prove that Exp4.P competes with any set of policies or experts of size while incurring regret at most with probability , where is the number of actions and is the number of rounds of interaction. This guarantee improves on all previous algorithms for this problem, whether in a stochastic or adversarial setting. We also test the new algorithm experimentally.
View on arXivComments on this paper
