ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.02594
27
5

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

5 October 2022
Jeongyeol Kwon
Yonathan Efroni
C. Caramanis
Shie Mannor
ArXivPDFHTML
Abstract

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among MMM candidates and an agent interacts with the MDP throughout the episode for HHH time steps. Our goal is to learn a near-optimal policy that nearly maximizes the HHH time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for M=2M=2M=2. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary M≥2M\ge2M≥2, we provide a sample-efficient algorithm--EM2\texttt{EM}^2EM2--that outputs an ϵ\epsilonϵ-optimal policy using O~(ϵ−2⋅SdAd⋅poly(H,Z)d)\tilde{O} \left(\epsilon^{-2} \cdot S^d A^d \cdot \texttt{poly}(H, Z)^d \right)O~(ϵ−2⋅SdAd⋅poly(H,Z)d) episodes, where S,AS, AS,A are the number of states and actions respectively, HHH is the time-horizon, ZZZ is the support size of reward distributions and d=min⁡(2M−1,H)d=\min(2M-1,H)d=min(2M−1,H). Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of (SA)Ω(M)/ϵ2(SA)^{\Omega(\sqrt{M})} / \epsilon^{2}(SA)Ω(M​)/ϵ2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in MMM is necessary.

View on arXiv
Comments on this paper