ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2510.05034
309
2
v1v2v3v4v5 (latest)

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

6 October 2025
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Zhangyun Tan
Qianxiang Shen
Jiani Liu
Hang Hua
Junjia Guo
Yunzhong Xiao
Chao Huang
Zhiyuan Wang
Susan Liang
Xinyi Liu
Yizhi Song
Yuhe Nie
Jia-Xing Zhong
Bozheng Li
Daiqing Qi
Ziyun Zeng
Ali Vosoughi
Luchuan Song
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
    MLLMOffRLVLMLRM
ArXiv (abs)PDFHTMLHuggingFace (43 upvotes)Github (142★)
Main:29 Pages
3 Figures
Bibliography:6 Pages
4 Tables
Appendix:1 Pages
Abstract

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: this https URL

View on arXiv
Comments on this paper