Decision Tree Algorithms for the Contextual Bandit Problem

To address the contextual bandit problem, we propose an online decision tree algorithm. We show that the proposed algorithm, KMD-Tree, incurs an expected cumulated regret in the order of O(log T) against the greedy decision tree built knowing the joint distribution of contexts and rewards. We show that this problem dependent regret bound is optimal up to a factor 1/\Delta in the logarithm. The dependence of the expected cumulated regret upon the number of contextual variables is logarithmic. The computational complexity of the proposed algorithm with respect to the time horizon is linear. These analytical results allow KMD-Tree to be efficient in real applications, where the number of events to process is huge, and where we expect that some contextual variables, chosen in a large set, have potentially non-linear dependencies with the rewards. In experiments done to illustrate the theoretical analysis, KMD-Tree obtains promising results in comparison with state-of-the-art algorithms.
View on arXiv