429

Smooth Non-Stationary Bandits

International Conference on Machine Learning (ICML), 2023
Main:21 Pages
8 Figures
Bibliography:3 Pages
Appendix:16 Pages
Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee T2/3T^{2/3} regret. However, in practice environments are often changing {\it smoothly}, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the {\it rate of change}. In this paper, we study a non-stationary two-arm bandit problem where we assume an arm's mean reward is a β\beta-H\"older function over (normalized) time, meaning it is (β1)(\beta-1)-times Lipschitz-continuously differentiable. We show the first {\it separation} between the smooth and non-smooth regimes by presenting a policy with T3/5T^{3/5} regret for β=2\beta=2. We complement this result by a Tβ+12β+1T^{\frac{\beta+1}{2\beta+1}} lower bound for any integer β1\beta\ge 1, which matches our upper bound for β=2\beta=2.

View on arXiv
Comments on this paper