ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.05349
118
0
v1v2 (latest)

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

5 June 2025
H. Rasheed
Abdelrahman M. Shaker
Anqi Tang
Muhammad Maaz
Ming-Hsuan Yang
Salman Khan
Fahad A Khan
    AIMat
ArXiv (abs)PDFHTML
Main:14 Pages
6 Figures
Bibliography:3 Pages
3 Tables
Abstract

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over 920920920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: this https URL

View on arXiv
@article{rasheed2025_2506.05349,
  title={ VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos },
  author={ Hanoona Rasheed and Abdelrahman Shaker and Anqi Tang and Muhammad Maaz and Ming-Hsuan Yang and Salman Khan and Fahad Shahbaz Khan },
  journal={arXiv preprint arXiv:2506.05349},
  year={ 2025 }
}
Comments on this paper