ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1811.05154
186
70
v1v2v3 (latest)

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

13 November 2018
Branislav Kveton
Csaba Szepesvári
Sharan Vaswani
Zheng Wen
Mohammad Ghavamzadeh
Tor Lattimore
ArXiv (abs)PDFHTML
Abstract

We propose a bandit algorithm that explores by randomizing its history of observations. The algorithm estimates the value of the arm from a non-parametric bootstrap sample of its history, which is augmented with pseudo observations. Our novel design of pseudo observations guarantees that the bootstrap estimates are optimistic with a high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and prove a O(KΔ−1log⁡n)O(K \Delta^{-1} \log n)O(KΔ−1logn) bound on its nnn-round regret, where KKK is the number of arms and Δ\DeltaΔ is the difference in the expected rewards of the optimal and the best suboptimal arms. The key advantage of our exploration design is that it can be easily applied to structured problems. To show this, we propose contextual Giro with an arbitrary non-linear reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that Giro performs well.

View on arXiv
Comments on this paper