Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

15 April 2025

Abstract

We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

View on arXiv

@article{ancarani2025_2504.11232,
  title={ Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset },
  author={ Elisa Ancarani and Julie Tores and Lucile Sassatelli and Rémy Sun and Hui-Yin Wu and Frédéric Precioso },
  journal={arXiv preprint arXiv:2504.11232},
  year={ 2025 }
}

Comments on this paper