ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.13180
24
1

Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks

22 November 2023
Jianqing Fan
Zhaoran Wang
Zhuoran Yang
Chenlu Ye
    OffRL
ArXivPDFHTML
Abstract

We study high-dimensional multi-armed contextual bandits with batched feedback where the TTT steps of online interactions are divided into LLL batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which L=TL = TL=T. For these settings, we design a provably sample-efficient algorithm which achieves a O~(s02log⁡2T) \mathcal{\tilde O}(s_0^2 \log^2 T)O~(s02​log2T) regret in the sparse case and O~(r2log⁡2T) \mathcal{\tilde O} ( r ^2 \log^2 T)O~(r2log2T) regret in the low-rank case, using only L=O(log⁡T)L = \mathcal{O}( \log T)L=O(logT) batches. Here s0s_0s0​ and rrr are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and O~(⋅) \mathcal{\tilde O}(\cdot)O~(⋅) omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only O(log⁡T)\mathcal{O}( \log T)O(logT) batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.

View on arXiv
Comments on this paper