We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order . The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order . These results significantly improve the state-of-the-art regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately -smooth, a result that was previously assumed in earlier works.
View on arXiv@article{ganesh2025_2404.02108, title={ Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs }, author={ Swetha Ganesh and Washim Uddin Mondal and Vaneet Aggarwal }, journal={arXiv preprint arXiv:2404.02108}, year={ 2025 } }