LIVEJoin the current RTAI Connect sessionJoin now

54
0

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

Abstract

Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce \textbf{COVER} (\textbf{\underline{CO}}unterfactual \textbf{\underline{V}}id\textbf{\underline{E}}o \textbf{\underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments.

View on arXiv
@article{zhou2025_2503.10691,
  title={ Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation },
  author={ Qiji Zhou and Yifan Gong and Guangsheng Bao and Hongjie Qiu and Jinqiang Li and Xiangrong Zhu and Huajian Zhang and Yue Zhang },
  journal={arXiv preprint arXiv:2503.10691},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.