7
0

Dense360: Dense Understanding from Omnidirectional Panoramas

Main:11 Pages
7 Figures
Bibliography:4 Pages
4 Tables
Abstract

Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.

View on arXiv
@article{zhou2025_2506.14471,
  title={ Dense360: Dense Understanding from Omnidirectional Panoramas },
  author={ Yikang Zhou and Tao Zhang and Dizhe Zhang and Shunping Ji and Xiangtai Li and Lu Qi },
  journal={arXiv preprint arXiv:2506.14471},
  year={ 2025 }
}
Comments on this paper