MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild
- 3DH

In this paper, we address the problem of 3D human pose understanding in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D pose. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images and 2D human pose annotations using 3D Motion Capture (MoCap) data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such approach is viable only with large training sets such as ours. Our method outperforms state-of-the-art in terms of 3D pose estimation in controlled environments (Human3.6M), showing promising results for in-the-wild images (LSP).
View on arXiv