27
10

Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity

Abstract

Many real-world problems face the dilemma of choosing best KK out of NN options at a given time instant. This setup can be modelled as combinatorial bandit which chooses KK out of NN arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandit where the reward received can be a non-linear function of the chosen KK arms. The direct use of multi-armed bandit requires choosing among NN-choose-KK options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in NN. The proposed algorithm is a divide-and-conquer based strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a regret bound of O~(K12N13T23)\tilde O(K^\frac{1}{2}N^\frac{1}{3}T^\frac{2}{3}) for a time horizon TT, which is sub-linear in all parameters TT, NN, and KK. The evaluation results on different reward functions and arm distribution functions show significantly improved performance as compared to standard multi-armed bandit approach with (NK)\binom{N}{K} choices.

View on arXiv
Comments on this paper