Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?

17 May 2025

Abstract

The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at:this https URL

View on arXiv

@article{dongfang2025_2505.11907,
  title={ Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? },
  author={ Zihao Dongfang and Xu Zheng and Ziqiao Weng and Yuanhuiyi Lyu and Danda Pani Paudel and Luc Van Gool and Kailun Yang and Xuming Hu },
  journal={arXiv preprint arXiv:2505.11907},
  year={ 2025 }
}

Comments on this paper