v1v2 (latest)

RubricBench: Aligning Model-Generated Rubrics with Human Standards

2 March 2026

Qiyuan Zhang

Junyi Zhou

Yufei Wang

Fuyuan Lyu

Yidong Ming

Can Xu

Qingfeng Sun

Kai Zheng

Peng Kang

Xue Liu

Chen Ma

ALM

ArXiv (abs)PDF HTML HuggingFace (53 upvotes)Github

Main:2 Pages

4 Figures

15 Tables

Appendix:21 Pages

Abstract

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

View on arXiv

Comments on this paper