v1v2 (latest)

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

24 February 2026

Santiago Gonzalez

Alireza Amiri Bavandpour

Peter Ye

Edward Zhang

Ruslans Aleksejevs

Todor Antić

Polina Baron

Sujeet Bhalerao

Shubhrajit Bhattacharya

Zachary Burton

John Byrne

Hyungjun Choi

Nujhat Ahmed Disha

Koppany István Encz

Yuchen Fang

Robert Joseph George

Ebrahim Ghorbani

Alan Goldfarb

Jing Guo

Meghal Gupta

Stefano Huber

Annika Kanckos

Minjung Kang

Hyun Jong Kim

Dino Lorenzini

Levi Lorenzo

Tianyi Mao

Giovanni Marzenta

Ariane M. Masuda

Lukas Mauth

Ana Mickovic

Andres Miniguano-Trujillo

Antoine Moulin

Wenqi Ni

Tomos Parry

Kevin Ren

Hossein Roodbarani

Mathieu Rundström

Manjil Saikia

Detchat Samart

Rebecca Steiner

Connor Stewart

Dhara Thakkar

Jeffrey Tse

Vasiliki Velona

Yunhai Xiang

Sibel Yalçın

Jun Yan

Ji Zeng

Arman Cohan

Quanquan C. Liu

ALM

ELM

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (4★)

Main:13 Pages

22 Figures

Bibliography:3 Pages

4 Tables

Appendix:107 Pages

Abstract

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published atthis https URL.

View on arXiv

Comments on this paper