An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.
View on arXiv@article{ravi2025_2505.24257, title={ Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames }, author={ Sahithya Ravi and Gabriel Sarch and Vibhav Vineet and Andrew D. Wilson and Balasaravanan Thoravi Kumaravel }, journal={arXiv preprint arXiv:2505.24257}, year={ 2025 } }