ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.03604
118
6
v1v2v3 (latest)

Online Markov Decision Processes with Non-oblivious Strategic Adversary

7 October 2021
Le Cong Dinh
D. Mguni
Long Tran-Thanh
Jun Wang
Yaodong Yang
ArXiv (abs)PDFHTML
Abstract

We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of O(Tlog⁡(L)+τ2Tlog⁡(∣A∣))\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})O(Tlog(L)​+τ2Tlog(∣A∣)​) where LLL is the size of adversary's pure strategy set and ∣A∣|A|∣A∣ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of O(Tlog⁡(L)+τ2Tklog⁡(k))\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})O(Tlog(L)​+τ2Tklog(k)​) where kkk depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result to a NE. To our best knowledge, this is first work leading to the last iteration result in OMDPs.

View on arXiv
Comments on this paper