LIVEJoin the current RTAI Connect sessionJoin now

76
124

Near-optimal Reinforcement Learning in Factored MDPs

Abstract

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer Ω(SAT)\Omega(\sqrt{SAT}) regret on some MDP, where TT is the elapsed time and SS and AA are the cardinalities of the state and action spaces. This implies T=Ω(SA)T = \Omega(SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, SS and AA can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than SS or AA. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.