Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

10 December 2025

Xinkui Zhao

Zuxin Wang

Yifan Zhang

Guanjie Cheng

Yueshen Xu

Shuiguang Deng

Chang Liu

Naibo Wang

Jianwei Yin

LRM

ArXiv (abs)PDF HTML

Main:8 Pages

11 Figures

Bibliography:2 Pages

5 Tables

Appendix:7 Pages

Abstract

The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

View on arXiv

Comments on this paper