ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1912.01192
13
103

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

3 December 2019
Chi Jin
Tiancheng Jin
Haipeng Luo
S. Sra
Tiancheng Yu
ArXivPDFHTML
Abstract

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O~(L∣X∣∣A∣T)\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})O~(L∣X∣∣A∣T​) regret with high probability, where LLL is the horizon, ∣X∣|X|∣X∣ is the number of states, ∣A∣|A|∣A∣ is the number of actions, and TTT is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure O~(T)\mathcal{\tilde{O}}(\sqrt{T})O~(T​) regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an upper occupancy bound\textit{upper occupancy bound}upper occupancy bound.

View on arXiv
Comments on this paper