ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.13788
89
0

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

20 May 2025
Yongshuo Zong
Qin Zhang
Dongsheng An
Zhihua Li
Xiang Xu
Linghan Xu
Zhuowen Tu
Yifan Xing
Onkar Dabeer
    ObjD
ArXiv (abs)PDFHTML
Main:8 Pages
14 Figures
Bibliography:3 Pages
17 Tables
Appendix:11 Pages
Abstract

This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

View on arXiv
@article{zong2025_2505.13788,
  title={ Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels },
  author={ Yongshuo Zong and Qin Zhang and Dongsheng An and Zhihua Li and Xiang Xu and Linghan Xu and Zhuowen Tu and Yifan Xing and Onkar Dabeer },
  journal={arXiv preprint arXiv:2505.13788},
  year={ 2025 }
}
Comments on this paper