ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.23765
56
0

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

31 March 2025
Y. Li
Y. Zhang
Tao Lin
Xiangrui Liu
Wenxiao Cai
Zheng Liu
Bo Zhao
    LRM
ArXivPDFHTML
Abstract

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

View on arXiv
@article{li2025_2503.23765,
  title={ STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? },
  author={ Yun Li and Yiming Zhang and Tao Lin and XiangRui Liu and Wenxiao Cai and Zheng Liu and Bo Zhao },
  journal={arXiv preprint arXiv:2503.23765},
  year={ 2025 }
}
Comments on this paper