ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.12108
17
1

An Asymptotically Optimal Batched Algorithm for the Dueling Bandit Problem

25 September 2022
Arpit Agarwal
R. Ghuge
V. Nagarajan
ArXivPDFHTML
Abstract

We study the KKK-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons. Previous learning algorithms have focused on the fully adaptive\textit{fully adaptive}fully adaptive setting, where the algorithm can make updates after every comparison. The "batched" dueling bandit problem is motivated by large-scale applications like web search ranking and recommendation systems, where performing sequential updates may be infeasible. In this work, we ask: \textit{is there a solution using only a few adaptive rounds that matches the asymptotic regret bounds of the best sequential algorithms for K-armed dueling bandits?} We answer this in the affirmative under the Condorcet condition\textit{under the Condorcet condition}under the Condorcet condition, a standard setting of the KKK-armed dueling bandit problem. We obtain asymptotic regret of O(K2log⁡2(K))+O(Klog⁡(T))O(K^2\log^2(K)) + O(K\log(T))O(K2log2(K))+O(Klog(T)) in O(log⁡(T))O(\log(T))O(log(T)) rounds, where TTT is the time horizon. Our regret bounds nearly match the best regret bounds known in the fully sequential setting under the Condorcet condition. Finally, in computational experiments over a variety of real-world datasets, we observe that our algorithm using O(log⁡(T))O(\log(T))O(log(T)) rounds achieves almost the same performance as fully sequential algorithms (that use TTT rounds).

View on arXiv
Comments on this paper