ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.21955
23
0

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

28 May 2025
Insu Lee
Wooje Park
Jaeyun Jang
Minyoung Noh
Kyuhong Shim
B. Shim
    EgoV
ArXivPDFHTML
Abstract

Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, their narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs.

View on arXiv
@article{lee2025_2505.21955,
  title={ Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs },
  author={ Insu Lee and Wooje Park and Jaeyun Jang and Minyoung Noh and Kyuhong Shim and Byonghyo Shim },
  journal={arXiv preprint arXiv:2505.21955},
  year={ 2025 }
}
Comments on this paper