ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2207.11126
11
7

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

22 July 2022
Orin Levy
Yishay Mansour
ArXivPDFHTML
Abstract

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of O~((H+1/pmin)H∣S∣3/2∣A∣Tlog⁡(max⁡{∣G∣,∣P∣}/δ))\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/\delta)})O((H+1/pmin​)H∣S∣3/2∣A∣Tlog(max{∣G∣,∣P∣}/δ)​) with probability 1−δ1-\delta1−δ, where P\mathcal{P}P and G\mathcal{G}G are finite and realizable function classes used to approximate the dynamics and rewards respectively, pminp_{min}pmin​ is the minimum reachability parameter, SSS is the set of states, AAA the set of actions, HHH the horizon, and TTT the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of Ω(TH∣S∣∣A∣ln⁡(∣G∣)/ln⁡(∣A∣))\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})Ω(TH∣S∣∣A∣ln(∣G∣)/ln(∣A∣)​), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains O~(T3/4)\widetilde{O}(T^{3/4})O(T3/4) regret.

View on arXiv
Comments on this paper