ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.04568
48
4

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

7 March 2024
Long-Fei Li
Peng Zhao
Zhi-Hua Zhou
ArXivPDFHTML
Abstract

We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an O~(dHS3K+HSAK)\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})O(dHS3K​+HSAK​) regret with high probability, where ddd is the dimension of feature mappings, SSS is the size of state space, AAA is the size of action space, HHH is the episode length and KKK is the number of episodes. Our result strictly improves the previous best-known O~(dS2K+HSAK)\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})O(dS2K​+HSAK​) result in Zhao et al. (2023a) since H≤SH \leq SH≤S holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

View on arXiv
Comments on this paper