ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.11489
47
5

A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

22 March 2022
Ziniu Li
Tian Xu
Yang Yu
ArXivPDFHTML
Abstract

Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is O~(∣S∣2∣A∣2(1−γ)−5ε−2)\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-\gamma)^{-5}\varepsilon^{-2})O(∣S∣2∣A∣2(1−γ)−5ε−2). Furthermore, we show that this sample complexity is improved to O~(∣S∣∣A∣(1−γ)−5ε−2)\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-5}\varepsilon^{-2})O(∣S∣∣A∣(1−γ)−5ε−2) if we can sequentially update all state-action pairs and O~(∣S∣∣A∣(1−γ)−4ε−2)\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-4}\varepsilon^{-2})O(∣S∣∣A∣(1−γ)−4ε−2) if γ\gammaγ is further in (1/2,1)(1/2, 1)(1/2,1). Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.

View on arXiv
Comments on this paper