25
0

Adversarial Multi-dueling Bandits

Pratik Gajane
Abstract

We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select m2m \geq 2 arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative TT-round regret of MiDEX compared to a Borda-winner from a set of KK arms is upper bounded by O((KlogK)1/3T2/3)O((K \log K)^{1/3} T^{2/3}). Moreover, we prove a lower bound of Ω(K1/3T2/3)\Omega(K^{1/3} T^{2/3}) for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

View on arXiv
Comments on this paper