ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.03289
22
36

Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

5 November 2021
Yeoneung Kim
Insoon Yang
Kwang-Sung Jun
ArXivPDFHTML
Abstract

In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve O~(min⁡{dK,d1.5∑k=1Kσk2}+d2)\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)O~(min{dK​,d1.5∑k=1K​σk2​​}+d2) where ddd is the dimension of the features, KKK is the time horizon, and σk2\sigma_k^2σk2​ is the noise variance at time step kkk, and O~\tilde OO~ ignores polylogarithmic dependence, which is a factor of d3d^3d3 improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in [0,1][0,1][0,1], we achieve a horizon-free regret bound of O~(dK+d2)\tilde O(d \sqrt{K} + d^2)O~(dK​+d2) where ddd is the number of base models and KKK is the number of episodes. This is a factor of d3.5d^{3.5}d3.5 improvement in the leading term and d7d^7d7 in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.

View on arXiv
Comments on this paper