36
0

EgoM2P: Egocentric Multimodal Multitask Pretraining

Main:8 Pages
8 Figures
Bibliography:6 Pages
10 Tables
Appendix:4 Pages
Abstract

Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models.

View on arXiv
@article{li2025_2506.07886,
  title={ EgoM2P: Egocentric Multimodal Multitask Pretraining },
  author={ Gen Li and Yutong Chen and Yiqian Wu and Kaifeng Zhao and Marc Pollefeys and Siyu Tang },
  journal={arXiv preprint arXiv:2506.07886},
  year={ 2025 }
}
Comments on this paper