50
9

Communication Efficient Parallel Reinforcement Learning

Abstract

We consider the problem where MM agents interact with MM identical and independent environments with SS states and AA actions using reinforcement learning for TT rounds. The agents share their data with a central server to minimize their regret. We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds. We provide \NAM\ which runs at each agent and prove that the total cumulative regret of MM agents is upper bounded as \TildeO(DSMAT)\Tilde{O}(DS\sqrt{MAT}) for a Markov Decision Process with diameter DD, number of states SS, and number of actions AA. The agents synchronize after their visitations to any state-action pair exceeds a certain threshold. Using this, we obtain a bound of O(MSAlog(MT))O\left(MSA\log(MT)\right) on the total number of communications rounds. Finally, we evaluate the algorithm against multiple environments and demonstrate that the proposed algorithm performs at par with an always communication version of the UCRL2 algorithm, while with significantly lower communication.

View on arXiv
Comments on this paper