ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.04692
11
42

Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

9 February 2021
Haike Xu
Tengyu Ma
S. Du
ArXivPDFHTML
Abstract

This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal QQQ-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional Ω(SΔmin)\Omega\left(\frac{S}{\Delta_{min}}\right)Ω(Δmin​S​) regret due to over-exploration where Δmin\Delta_{min}Δmin​ is the minimum sub-optimality gap and SSS is the number of states. We further show that for general MDPs, AMB suffers an additional ∣Zmul∣Δmin\frac{|Z_{mul}|}{\Delta_{min}}Δmin​∣Zmul​∣​ regret, where ZmulZ_{mul}Zmul​ is the set of state-action pairs (s,a)(s,a)(s,a)'s satisfying aaa is a non-unique optimal action for sss. We complement our upper bound with a lower bound showing the dependency on ∣Zmul∣Δmin\frac{|Z_{mul}|}{\Delta_{min}}Δmin​∣Zmul​∣​ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.

View on arXiv
Comments on this paper