90

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar
Anka Reuel
Prajna Soni
Sanchit Ahuja
Pawan Sasanka Ammanamanchi
Ruchit Rawal
Vilém Zouhar
Srishti Yadav
Chenxi Whitehouse
Dayeon Ki
Jennifer Mickel
Leshem Choshen
Marek Šuppa
Jan Batzner
Jenny Chim
Jeba Sania
Yanan Long
Hossein A. Rahmani
Christina Knight
Yiyang Nan
Jyoutir Raj
Yu Fan
Shubham Singh
Subramanyam Sahoo
Eliya Habba
Usman Gohar
Siddhesh Pawar
Robert Scholz
Arjun Subramonian
Jingwei Ni
Mykel Kochenderfer
Sanmi Koyejo
Mrinmaya Sachan
Stella Biderman
Zeerak Talat
Avijit Ghosh
Irene Solaiman
Main:10 Pages
5 Figures
Bibliography:6 Pages
7 Tables
Appendix:5 Pages
Abstract

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.

View on arXiv
Comments on this paper