ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.07704
26
16

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

16 May 2022
D. Tiapkin
Denis Belomestny
Eric Moulines
A. Naumov
S. Samsonov
Yunhao Tang
Michal Valko
Pierre Menard
ArXivPDFHTML
Abstract

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order O~(H3SAT)\widetilde{O}(\sqrt{H^3SAT})O(H3SAT​) where HHH is the length of one episode, SSS is the number of states, AAA the number of actions, TTT the number of episodes, that matches the lower-bound of Ω(H3SAT)\Omega(\sqrt{H^3SAT})Ω(H3SAT​) up to poly-log⁡\loglog terms in H,S,A,TH,S,A,TH,S,A,T for a large enough TTT. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon HHH (and SSS) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).

View on arXiv
Comments on this paper