ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17015
77
1

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

22 May 2025
Runsen Xu
Weiyao Wang
Hao Tang
Xingyu Chen
Xiaodong Wang
Fu-Jen Chu
Dahua Lin
Matt Feiszli
Kevin J. Liang
    LRM
ArXivPDFHTML
Abstract

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

View on arXiv
@article{xu2025_2505.17015,
  title={ Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models },
  author={ Runsen Xu and Weiyao Wang and Hao Tang and Xingyu Chen and Xiaodong Wang and Fu-Jen Chu and Dahua Lin and Matt Feiszli and Kevin J. Liang },
  journal={arXiv preprint arXiv:2505.17015},
  year={ 2025 }
}
Comments on this paper