This paper investigates the problem of non-stationary linear bandits, where the unknown regression parameter is evolving over time. Existing studies develop various algorithms and show that they enjoy an dynamic regret, where is the time horizon and is the path-length that measures the fluctuation of the evolving unknown parameter. In this paper, we discover that a serious technical flaw makes their results ungrounded, and then present a fix, which gives an dynamic regret without modifying original algorithms. Furthermore, we demonstrate that instead of using sophisticated mechanisms, such as sliding window or weighted penalty, a simple restarted strategy is sufficient to attain the same regret guarantee. Specifically, we design an UCB-type algorithm to balance exploitation and exploration, and restart it periodically to handle the drift of unknown parameters. Our approach enjoys an dynamic regret. Note that to achieve this bound, the algorithm requires an oracle knowledge of the path-length . Combining the bandits-over-bandits mechanism by treating our algorithm as the base learner, we can further achieve the same regret bound in a parameter-free way. Empirical studies also validate the effectiveness of our approach.
View on arXiv