Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits

Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively in recent years. In the single-pass setting with arms and trials, a regret lower bound of has been proved for any algorithm with memory (Maiti et al. [NeurIPS'21]; Agarwal at al. [COLT'22]). On the other hand, however, the previous best regret upper bound is still , which is achieved by the streaming implementation of the simple uniform exploration. The gap leaves the open question of the tight regret bound in the single-pass MABs with sublinear arm memory. In this paper, we answer this open problem and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to for algorithms with memory, which matches the uniform exploration regret up to a logarithm factor in . We then show that the factor is not necessary, and we can achieve regret by finding an -best arm and committing to it in the rest of the trials. For regret minimization with high constant probability, we can apply the single-memory -best arm algorithms in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the expected regret minimization, we design an algorithm with a single-arm memory that achieves regret, and an algorithm with -memory with the optimal regret following the -best arm algorithm in Assadi and Wang [STOC'20]. We further tested the empirical performances of our algorithms. The simulation results show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.
View on arXiv