EgoM2P: Egocentric Multimodal Multitask Pretraining
- EgoVVLM

Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models.
View on arXiv@article{li2025_2506.07886, title={ EgoM2P: Egocentric Multimodal Multitask Pretraining }, author={ Gen Li and Yutong Chen and Yiqian Wu and Kaifeng Zhao and Marc Pollefeys and Siyu Tang }, journal={arXiv preprint arXiv:2506.07886}, year={ 2025 } }