ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1911.09458
18
16

Observe Before Play: Multi-armed Bandit with Pre-observations

21 November 2019
Jinhang Zuo
Xiaoxi Zhang
Carlee Joe-Wong
ArXivPDFHTML
Abstract

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for KKK arms with Bernoulli rewards, and prove a TTT-round regret upper bound O(K2log⁡T)O(K^2\log T)O(K2logT). In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its TTT-round regret relative to an offline greedy strategy is upper bounded in O(K4M2log⁡T)O(\frac{K^4}{M^2}\log T)O(M2K4​logT) for KKK arms and MMM players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.

View on arXiv
Comments on this paper