ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1310.2997
27
96

Bandits with Switching Costs: T^{2/3} Regret

11 October 2013
O. Dekel
Jian Ding
Tomer Koren
Yuval Peres
ArXivPDFHTML
Abstract

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's TTT-round minimax regret in this setting is Θ~(T2/3)\widetilde{\Theta}(T^{2/3})Θ(T2/3), thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(T)\Theta(\sqrt{T})Θ(T​). The difference between these two rates provides the \emph{first} indication that learning with bandit feedback can be significantly harder than learning with full-information feedback (previous results only showed a different dependence on the number of actions, but not on TTT.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of Θ~(T2/3)\widetilde{\Theta}(T^{2/3})Θ(T2/3). Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is Θ~(T2/3)\widetilde{\Theta}(T^{2/3})Θ(T2/3). The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.

View on arXiv
Comments on this paper