ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.06294
14
79

Adaptive Reward-Free Exploration

11 June 2020
E. Kaufmann
Pierre Ménard
O. D. Domingues
Anders Jonsson
Edouard Leurent
Michal Valko
ArXivPDFHTML
Abstract

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order (SAH4/ε2)(log⁡(1/δ)+S)({SAH^4}/{\varepsilon^2})(\log(1/\delta) + S)(SAH4/ε2)(log(1/δ)+S) episodes to output, with probability 1−δ1-\delta1−δ, an ε\varepsilonε-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small ε\varepsilonε and the small δ\deltaδ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

View on arXiv
Comments on this paper