Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

19 April 2025

Abstract

We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

View on arXiv

@article{arnaud2025_2504.14151,
  title={ Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D },
  author={ Sergio Arnaud and Paul McVay and Ada Martin and Arjun Majumdar and Krishna Murthy Jatavallabhula and Phillip Thomas and Ruslan Partsey and Daniel Dugas and Abha Gejji and Alexander Sax and Vincent-Pierre Berges and Mikael Henaff and Ayush Jain and Ang Cao and Ishita Prasad and Mrinal Kalakrishnan and Michael Rabbat and Nicolas Ballas and Mido Assran and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier },
  journal={arXiv preprint arXiv:2504.14151},
  year={ 2025 }
}

Comments on this paper