Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity

Many real-world problems face the dilemma of choosing best out of options at a given time instant. This setup can be modelled as combinatorial bandit which chooses out of arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandit where the reward received can be a non-linear function of the chosen arms. The direct use of multi-armed bandit requires choosing among -choose- options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in . The proposed algorithm is a divide-and-conquer based strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a regret bound of for a time horizon , which is sub-linear in all parameters , , and . The evaluation results on different reward functions and arm distribution functions show significantly improved performance as compared to standard multi-armed bandit approach with choices.
View on arXiv