SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

18 June 2025

Chengye Wang

Yifei Shen

Zexi Kuang

Arman Cohan

Yilun Zhao

Author Contacts:

ArXiv (abs)PDF HTML

Main:8 Pages

15 Figures

Bibliography:3 Pages

14 Tables

Appendix:7 Pages

Abstract

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

View on arXiv

@article{wang2025_2506.15569,
  title={ SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification },
  author={ Chengye Wang and Yifei Shen and Zexi Kuang and Arman Cohan and Yilun Zhao },
  journal={arXiv preprint arXiv:2506.15569},
  year={ 2025 }
}

Comments on this paper