In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee regret. However, in practice environments are often changing {\bf smoothly}, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. We study a non-stationary two-armed bandits problem where we assume that an arm's mean reward is a -H\"older function over (normalized) time, meaning it is -times Lipschitz-continuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with regret for . We complement this result by an lower bound for any integer , which matches our upper bound for .
View on arXiv