Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models.

View on arXiv

@article{li2025_2506.07886,
  title={ EgoM2P: Egocentric Multimodal Multitask Pretraining },
  author={ Gen Li and Yutong Chen and Yiqian Wu and Kaifeng Zhao and Marc Pollefeys and Siyu Tang },
  journal={arXiv preprint arXiv:2506.07886},
  year={ 2025 }
}

Comments on this paper