Online Learning in Adversarial MDPs: Is the Communicating Case Harder than Ergodic?

Abstract
We study online learning in adversarial communicating Markov Decision Processes with full information. We give an algorithm that achieves a regret of with respect to the best fixed deterministic policy in hindsight when the transitions are deterministic. We also prove a regret lower bound in this setting which is tight up to polynomial factors in the MDP parameters. We also give an inefficient algorithm that achieves regret in communicating MDPs (with an additional mild restriction on the transition dynamics).
View on arXivComments on this paper