ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.12386
190
51
v1v2v3 (latest)

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

21 January 2025
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
Xiangyu Zeng
Chenting Wang
Changlian Ma
Haian Huang
Jianfei Gao
Min Dou
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
ArXiv (abs)PDFHTML
Main:11 Pages
6 Figures
Bibliography:6 Pages
5 Tables
Abstract

This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available atthis https URL

View on arXiv
Comments on this paper