ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.11232
24
0

Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

15 April 2025
Elisa Ancarani
Julie Tores
L. Sassatelli
Rémy Sun
Hui-Yin Wu
F. Precioso
ArXivPDFHTML
Abstract

We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

View on arXiv
@article{ancarani2025_2504.11232,
  title={ Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset },
  author={ Elisa Ancarani and Julie Tores and Lucile Sassatelli and Rémy Sun and Hui-Yin Wu and Frédéric Precioso },
  journal={arXiv preprint arXiv:2504.11232},
  year={ 2025 }
}
Comments on this paper