Distillation Policy Optimization

1 February 2023

Abstract

On-policy algorithms are supposed to be stable, however, sample-intensive yet. Off-policy algorithms utilizing past experiences are deemed to be sample-efficient, nevertheless, unstable in general. Can we design an algorithm that can employ the off-policy data, while exploit the stable learning by sailing along the course of the on-policy walkway? In this paper, we present an actor-critic learning framework that cross-breeds two sources of the data for both evaluation and control, which enables fast learning and can be applied to a set of on-policy algorithms. In its backbone, the variance reduction mechanisms, such as unified advantage estimator (UAE) and a learned baseline, are able to mitigate both the long-term and instantaneous noise, which can even be incorporated into the off-policy learning. Empirical results demonstrate significant improvements in sample efficiency, suggesting our method as a promising new learning paradigm.

View on arXiv

Comments on this paper