ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.00468
38
5

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

29 June 2024
Jinsheng Huang
Liang Chen
Taian Guo
Fu Zeng
Yusheng Zhao
Bohan Wu
Ye Yuan
Haozhe Zhao
Zhihui Guo
Yichi Zhang
Jingyang Yuan
Wei Ju
Luchen Liu
Tianyu Liu
Baobao Chang
Ming Zhang
ArXivPDFHTML
Abstract

Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,1382,1382,138 question triplets, totaling 6,4146,4146,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73%31.73\%31.73%, compared to an average gap of 8.03%8.03\%8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09%23.09\%23.09%, whereas the gap for previous benchmarks is just 14.64%14.64\%14.64%). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

View on arXiv
@article{huang2025_2407.00468,
  title={ MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation },
  author={ Jinsheng Huang and Liang Chen and Taian Guo and Fu Zeng and Yusheng Zhao and Bohan Wu and Ye Yuan and Haozhe Zhao and Zhihui Guo and Yichi Zhang and Jingyang Yuan and Wei Ju and Luchen Liu and Tianyu Liu and Baobao Chang and Ming Zhang },
  journal={arXiv preprint arXiv:2407.00468},
  year={ 2025 }
}
Comments on this paper