EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

4 March 2026

Zhenyu Li

Sai Kumar Dwivedi

Filip Maric

Carlos Chacon

Nadine Bertsch

Filippo Arcadu

Tomas Hodan

Michael Ramamonjisoa

Peter Wonka

Amy Zhao

Robin Kips

Cem Keskin

Anastasia Tkach

Chenhongyi Yang

EgoV

ArXiv (abs)PDF HTML Github

Main:8 Pages

9 Figures

Bibliography:3 Pages

8 Tables

Appendix:3 Pages

Abstract

Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.

View on arXiv

Comments on this paper