Fine-Grained Video Captioning through Scene Graph Consolidation

23 February 2025

Abstract

Recent advances in visual language models (VLMs) have significantly improved image captioning, but extending these gains to video understanding remains challenging due to the scarcity of fine-grained video captioning datasets. To bridge this gap, we propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation. Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions. To achieve this, we leverage a lightweight graph-to-text model trained solely on text corpora, eliminating the need for video captioning annotations. Experiments on the MSR-VTT and ActivityNet Captions datasets show that our approach outperforms zero-shot video captioning baselines, demonstrating that aggregating frame-level scene graphs yields rich video understanding without requiring large-scale paired data or high inference cost.

View on arXiv

@article{chu2025_2502.16427,
  title={ Fine-Grained Video Captioning through Scene Graph Consolidation },
  author={ Sanghyeok Chu and Seonguk Seo and Bohyung Han },
  journal={arXiv preprint arXiv:2502.16427},
  year={ 2025 }
}

Comments on this paper