96

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Zehua Zhao
Zhixian Huang
Junren Li
Siyu Lin
Junting Zhou
Fengqi Cao
Kun Zhou
Rui Ge
Tingting Long
Yuexiang Zhu
Yan Liu
Jie Zheng
Junnian Wei
Rong Zhu
Peng Zou
Wenyu Li
Zekai Cheng
Tian Ding
Yaxuan Wang
Yizhao Yan
Tingru Wei
Haowei Ming
Weijie Mao
Chen Sun
Yiming Liu
Zichen Wang
Zuo Zhang
Tong Yang
Hao Ma
Zhen Gao
Jian Pei
Abstract

Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available atthis https URL.

View on arXiv
Main:14 Pages
11 Figures
Bibliography:2 Pages
5 Tables
Appendix:19 Pages
Comments on this paper