58
4

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Abstract

Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released atthis https URLandthis https URL.

View on arXiv
@article{hsieh2025_2407.06723,
  title={ Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions },
  author={ Yu-Guan Hsieh and Cheng-Yu Hsieh and Shih-Ying Yeh and Louis Béthune and Hadi Pour Ansari and Pavan Kumar Anasosalu Vasu and Chun-Liang Li and Ranjay Krishna and Oncel Tuzel and Marco Cuturi },
  journal={arXiv preprint arXiv:2407.06723},
  year={ 2025 }
}
Comments on this paper