575

An Optimal High Probability Algorithm for the Contextual Bandit Problem

International Conference on Artificial Intelligence and Statistics (AISTATS), 2010
Abstract

We consider the problem of learning to predict with expert advice in an adversarial, on-line bandit setting. We study how to behave in a way that achieves nearly as much reward as the best expert with high probability, rather than in expectation. We provide the algorithm Exp4.P for solving this contextual bandit problem. We prove that Exp4.P competes with any set of policies or experts of size NN while incurring regret at most O(KTln(N/δ))O(\sqrt{KT\ln(N/\delta)}) with probability 1δ1-\delta, where KK is the number of actions and TT is the number of rounds of interaction. This guarantee improves on all previous algorithms for this problem, whether in a stochastic or adversarial setting. We also test the new algorithm experimentally.

View on arXiv
Comments on this paper