Unsupervised Temporal Learning on Monocular Videos for 3D Human Pose Estimation

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

2 December 2020

Abstract

In this paper we propose an unsupervised learning method to extract temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features into the time-variant component, well-suited for human pose estimation. Our approach reduces error by about 50\% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.

View on arXiv

Comments on this paper