ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.07635
41
2

Cross-modal Causal Relation Alignment for Video Question Grounding

5 March 2025
Weixing Chen
Y. Liu
Binglin Chen
Jiandong Su
Yongsen Zheng
Liang Lin
    BDL
    VGen
    CML
ArXivPDFHTML
Abstract

Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available atthis https URL.

View on arXiv
@article{chen2025_2503.07635,
  title={ Cross-modal Causal Relation Alignment for Video Question Grounding },
  author={ Weixing Chen and Yang Liu and Binglin Chen and Jiandong Su and Yongsen Zheng and Liang Lin },
  journal={arXiv preprint arXiv:2503.07635},
  year={ 2025 }
}
Comments on this paper