39
6

Improved Algorithms for Misspecified Linear Markov Decision Processes

Abstract

For the misspecified linear Markov decision process (MLMDP) model of Jin et al. [2020], we propose an algorithm with three desirable properties. (P1) Its regret after KK episodes scales as Kmax{εmis,εtol}K \max \{ \varepsilon_{\text{mis}}, \varepsilon_{\text{tol}} \}, where εmis\varepsilon_{\text{mis}} is the degree of misspecification and εtol\varepsilon_{\text{tol}} is a user-specified error tolerance. (P2) Its space and per-episode time complexities remain bounded as KK \rightarrow \infty. (P3) It does not require εmis\varepsilon_{\text{mis}} as input. To our knowledge, this is the first algorithm satisfying all three properties. For concrete choices of εtol\varepsilon_{\text{tol}}, we also improve existing regret bounds (up to log factors) while achieving either (P2) or (P3) (existing algorithms satisfy neither). At a high level, our algorithm generalizes (to MLMDPs) and refines the Sup-Lin-UCB algorithm, which Takemura et al. [2021] recently showed satisfies (P3) for contextual bandits. We also provide an intuitive interpretation of their result, which informs the design of our algorithm.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.