ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.19015
28
0

Can Multimodal Large Language Models Understand Spatial Relations?

25 May 2025
Jingping Liu
Ziyan Liu
Zhedong Cen
Yan Zhou
Yinan Zou
Weiyan Zhang
Haiyun Jiang
Tong Ruan
    LRM
ArXiv (abs)PDFHTML
Main:8 Pages
7 Figures
Bibliography:3 Pages
14 Tables
Appendix:2 Pages
Abstract

Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available atthis https URL.

View on arXiv
@article{liu2025_2505.19015,
  title={ Can Multimodal Large Language Models Understand Spatial Relations? },
  author={ Jingping Liu and Ziyan Liu and Zhedong Cen and Yan Zhou and Yinan Zou and Weiyan Zhang and Haiyun Jiang and Tong Ruan },
  journal={arXiv preprint arXiv:2505.19015},
  year={ 2025 }
}
Comments on this paper