ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2009.04575
17
8

Improved Exploration in Factored Average-Reward MDPs

9 September 2020
M. S. Talebi
Anders Jonsson
Odalric-Ambrym Maillard
ArXivPDFHTML
Abstract

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space X\mathcal XX and the state-space S\mathcal SS admit the respective factored forms of X=⊗i=1nXi\mathcal X = \otimes_{i=1}^n \mathcal X_iX=⊗i=1n​Xi​ and S=⊗i=1mSi\mathcal S=\otimes_{i=1}^m \mathcal S_iS=⊗i=1m​Si​, and the transition and reward functions are factored over X\mathcal XX and S\mathcal SS. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of Si\mathcal S_iSi​'s and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.

View on arXiv
Comments on this paper