ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.01805
33
0

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

2 April 2025
Kun Ouyang
Yuanxin Liu
Haoning Wu
Yi Liu
Hao Zhou
Jie Zhou
Fandong Meng
Xu Sun
    LRM
ArXivPDFHTML
Abstract

Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the SpaceR\textbf{SpaceR}SpaceR framework. First, we present SpaceR-151k\textbf{SpaceR-151k}SpaceR-151k, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose Spatially-Guided RLVR (SG-RLVR)\textbf{Spatially-Guided RLVR (SG-RLVR)}Spatially-Guided RLVR (SG-RLVR), a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available atthis https URL.

View on arXiv
@article{ouyang2025_2504.01805,
  title={ SpaceR: Reinforcing MLLMs in Video Spatial Reasoning },
  author={ Kun Ouyang and Yuanxin Liu and Haoning Wu and Yi Liu and Hao Zhou and Jie Zhou and Fandong Meng and Xu Sun },
  journal={arXiv preprint arXiv:2504.01805},
  year={ 2025 }
}
Comments on this paper