We study high-dimensional multi-armed contextual bandits with batched feedback where the steps of online interactions are divided into batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which . For these settings, we design a provably sample-efficient algorithm which achieves a regret in the sparse case and regret in the low-rank case, using only batches. Here and are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.
View on arXiv