ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.03635
56
25
v1v2v3v4 (latest)

Sublinear Regret for Learning POMDPs

8 July 2021
Yi Xiong
Ningyuan Chen
Xuefeng Gao
Xiang Zhou
ArXiv (abs)PDFHTML
Abstract

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of O(T2/3log⁡T)O(T^{2/3}\sqrt{\log T})O(T2/3logT​) for the proposed learning algorithm where TTT is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

View on arXiv
Comments on this paper