48
3

Optimistic Q-learning for average reward and episodic reinforcement learning

Abstract

We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state s0s_0 is finite and upper bounded by HH, either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of O~(H5SAT)\tilde{O}(H^5 S\sqrt{AT}), where SS and AA are the numbers of states and actions, and TT is the horizon. A key technical novelty of our work is the introduction of an L\overline{L} operator defined as Lv=1Hh=1HLhv\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v where LL denotes the Bellman operator. Under the given assumption, we show that the L\overline{L} operator has a strict contraction (in span) even in the average-reward setting where the discount factor is 11. Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.

View on arXiv
@article{agrawal2025_2407.13743,
  title={ Optimistic Q-learning for average reward and episodic reinforcement learning },
  author={ Priyank Agrawal and Shipra Agrawal },
  journal={arXiv preprint arXiv:2407.13743},
  year={ 2025 }
}
Comments on this paper