ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.12329
50
1

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

16 March 2025
Kanzhi Cheng
Wenpo Song
Jiaxin Fan
Zheng Ma
Qiushi Sun
Fangzhi Xu
Chenyang Yan
Nuo Chen
Jianbing Zhang
Jiajun Chen
    MLLM
    VLM
ArXivPDFHTML
Abstract

Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just 4pertest.Dataandresourceswillbeopen−sourcedatthishttpsURL.4 per test. Data and resources will be open-sourced atthis https URL.4pertest.Dataandresourceswillbeopen−sourcedatthishttpsURL.

View on arXiv
@article{cheng2025_2503.12329,
  title={ CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era },
  author={ Kanzhi Cheng and Wenpo Song and Jiaxin Fan and Zheng Ma and Qiushi Sun and Fangzhi Xu and Chenyang Yan and Nuo Chen and Jianbing Zhang and Jiajun Chen },
  journal={arXiv preprint arXiv:2503.12329},
  year={ 2025 }
}
Comments on this paper