ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.13586
90
21

Settling the Sample Complexity of Online Reinforcement Learning

25 July 2023
Zihan Zhang
Yuxin Chen
Jason D. Lee
S. Du
    OffRL
ArXivPDFHTML
Abstract

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*}\min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where SSS is the number of states, AAA is the number of actions, HHH is the planning horizon, and KKK is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K≥1K\geq 1K≥1, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield ε\varepsilonε-accuracy) of SAH3ε2\frac{SAH^3}{\varepsilon^2}ε2SAH3​ up to log factor, which is minimax-optimal for the full ε\varepsilonε-range.Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

View on arXiv
@article{zhang2025_2307.13586,
  title={ Settling the Sample Complexity of Online Reinforcement Learning },
  author={ Zihan Zhang and Yuxin Chen and Jason D. Lee and Simon S. Du },
  journal={arXiv preprint arXiv:2307.13586},
  year={ 2025 }
}
Comments on this paper