182
48

Decision Tree Algorithms for the Contextual Bandit Problem

Abstract

To address the contextual bandit problem, we propose online decision tree algorithms. We show that KMD-Tree incurs an expected cumulated regret in the order of O(logT)O(\log T) against the greedy decision tree built knowing the joint distribution of contexts and rewards, and that the learning algorithm is optimal up to a factor 1/Δ1/\Delta in the logarithm. The dependence of the expected cumulated regret upon the number of contextual variables is logarithmic. The computational complexity of the proposed algorithm with respect to the time horizon is linear. These analytical results allow KMD-Tree to be efficient in real applications, where the number of events to process is huge, and where we expect that some contextual variables, chosen in a large set, have potentially non-linear dependencies with the rewards. Finally, the parallel nature of its learning allows to build a randomized collection of KMD-Trees, the KMD-Forest. The analysis of KMD-Forest is done against a strong reference, the Random Forest built knowing joint distribution of contexts and rewards. We show that KMD-Forest incurs an expected cumulated regret in the order of O(logT)O(\log T) and that the proposed algorithm is optimal up to a factor 1/Δ1/\Delta in the logarithm. In experiments done to illustrate the theoretical analysis, KMD-Tree and KMD-Forest obtain promising results in comparison with state-of-the-art algorithms.

View on arXiv
Comments on this paper