Cascading bandits model the task of learning to rank out of items over rounds of partial feedback. For this model, the minimax (i.e., gap-free) regret is poorly understood; in particular, the best known lower and upper bounds are and , respectively. We improve the lower bound to and show CascadeKL-UCB (which ranks items by their KL-UCB indices) attains it up to log terms. Surprisingly, we also show CascadeUCB1 (which ranks via UCB1) can suffer suboptimal regret. This sharply contrasts with standard -armed bandits, where the corresponding algorithms both achieve the minimax regret (up to log terms), and the main advantage of KL-UCB is only to improve constants in the gap-dependent bounds. In essence, this contrast occurs because Pinsker's inequality is tight for hard problems in the -armed case but loose (by a factor of ) in the cascading case.
View on arXiv