31
94

Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

Abstract

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of O(MKTlogT)O(\sqrt{MKT\log T}), where TT is the number of time steps, KK is the number of arms, and MM is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in TT up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset to demonstrate its superior performance.

View on arXiv
Comments on this paper