ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.04008
11
33

PAC Battling Bandits in the Plackett-Luce Model

12 August 2018
Aadirupa Saha
Aditya Gopalan
ArXivPDFHTML
Abstract

We introduce the probably approximately correct (PAC) \emph{Battling-Bandit} problem with the Plackett-Luce (PL) subset choice model--an online learning framework where at each trial the learner chooses a subset of kkk arms from a fixed set of nnn arms, and subsequently observes a stochastic feedback indicating preference information of the items in the chosen subset, e.g., the most preferred item or ranking of the top mmm most preferred items etc. The objective is to identify a near-best item in the underlying PL model with high confidence. This generalizes the well-studied PAC \emph{Dueling-Bandit} problem over nnn arms, which aims to recover the \emph{best-arm} from pairwise preference information, and is known to require O(nϵ2ln⁡1δ)O(\frac{n}{\epsilon^2} \ln \frac{1}{\delta})O(ϵ2n​lnδ1​) sample complexity \citep{Busa_pl,Busa_top}. We study the sample complexity of this problem under various feedback models: (1) Winner of the subset (WI), and (2) Ranking of top-mmm items (TR) for 2≤m≤k2\le m \le k2≤m≤k. We show, surprisingly, that with winner information (WI) feedback over subsets of size 2≤k≤n2 \leq k \leq n2≤k≤n, the best achievable sample complexity is still O(nϵ2ln⁡1δ)O\left( \frac{n}{\epsilon^2} \ln \frac{1}{\delta}\right)O(ϵ2n​lnδ1​), independent of kkk, and the same as that in the Dueling Bandit setting (k=2k=2k=2). For the more general top-mmm ranking (TR) feedback model, we show a significantly smaller lower bound on sample complexity of Ω(nmϵ2ln⁡1δ)\Omega\bigg( \frac{n}{m\epsilon^2} \ln \frac{1}{\delta}\bigg)Ω(mϵ2n​lnδ1​), which suggests a multiplicative reduction by a factor m{m}m owing to the additional information revealed from preferences among mmm items instead of just 111. We also propose two algorithms for the PAC problem with the TR feedback model with optimal (upto logarithmic factors) sample complexity guarantees, establishing the increase in statistical efficiency from exploiting rank-ordered feedback.

View on arXiv
Comments on this paper