Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

30 May 2025

Main:9 Pages

14 Figures

Bibliography:2 Pages

3 Tables

Appendix:5 Pages

Abstract

An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.

View on arXiv

@article{ravi2025_2505.24257,
  title={ Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames },
  author={ Sahithya Ravi and Gabriel Sarch and Vibhav Vineet and Andrew D. Wilson and Balasaravanan Thoravi Kumaravel },
  journal={arXiv preprint arXiv:2505.24257},
  year={ 2025 }
}

Comments on this paper