38
27

An operator view of policy gradient methods

Abstract

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator I\mathcal{I}, which maps any policy π\pi to a better one Iπ\mathcal{I}\pi, and a projection operator P\mathcal{P}, which finds the best approximation of Iπ\mathcal{I}\pi in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of I\mathcal{I} and P\mathcal{P} to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.

View on arXiv
Comments on this paper