Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

27 March 2026

Xue Liu

Xin Ma

Yuxin Ma

Yongchang Peng

Duo Wang

Zhoufutu Wen

Ge Zhang

Kaiyuan Zhang

Xinyu Chen

Tianci He

Jiani Hou

Liang Hu

Ziyun Huang

Yongzhe Hui

Jianpeng Jiao

Chennan Ju

Yingru Kong

Yiran Li

Mengyun Liu

Luyao Ma

Fei Ni

Yiqing Ni

Yueyan Qiu

Yanle Ren

Zilin Shi

Zaiyuan Wang

Wenjie Yue

Shiyu Zhang

Xinyi Zhang

Kaiwen Zhao

Zhenwei Zhu

Shanshan Wu

Qi Zhao

Wenhao Huang

ALM

ELM

AI4MH

ArXiv (abs)PDF HTML Github (1964★)

Main:11 Pages

3 Figures

Bibliography:3 Pages

12 Tables

Appendix:11 Pages

Abstract

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

View on arXiv

Comments on this paper