ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1403.3741
80
124
v1v2v3 (latest)

Near-optimal Reinforcement Learning in Factored MDPs

15 March 2014
Ian Osband
Benjamin Van Roy
ArXiv (abs)PDFHTML
Abstract

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer Ω(SAT)\Omega(\sqrt{SAT})Ω(SAT​) regret on some MDP, where TTT is the elapsed time and SSS and AAA are the cardinalities of the state and action spaces. This implies T=Ω(SA)T = \Omega(SA)T=Ω(SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, SSS and AAA can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than SSS or AAA. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

View on arXiv
Comments on this paper