When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

18 February 2026

Mubashara Akhtar

Anka Reuel

Prajna Soni

Sanchit Ahuja

Pawan Sasanka Ammanamanchi

Ruchit Rawal

Vilém Zouhar

Srishti Yadav

Chenxi Whitehouse

Dayeon Ki

Jennifer Mickel

Leshem Choshen

Marek Šuppa

Jan Batzner

Jenny Chim

Jeba Sania

Yanan Long

Hossein A. Rahmani

Christina Knight

Yiyang Nan

Jyoutir Raj

Yu Fan

Shubham Singh

Subramanyam Sahoo

Eliya Habba

Usman Gohar

Siddhesh Pawar

Robert Scholz

Arjun Subramonian

Jingwei Ni

Mykel Kochenderfer

Sanmi Koyejo

Mrinmaya Sachan

Stella Biderman

Zeerak Talat

Avijit Ghosh

Irene Solaiman

ALM

ELM

VLM

ArXiv (abs)PDF HTML Github

Main:10 Pages

5 Figures

Bibliography:6 Pages

7 Tables

Appendix:5 Pages

Abstract

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.

View on arXiv

Comments on this paper