The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm

9 January 2026

Aparna Elangovan

Lei Xu

Mahsa Elyasi

Ismail Akdulum

Mehmet Aksakal

Enes Gurun

Brian Hur

Saab Mansour

Ravid Shwartz Ziv

Karin Verspoor

Dan Roth

UQCV

ELM

ArXiv (abs)PDF HTML

Main:14 Pages

6 Figures

Bibliography:4 Pages

5 Tables

Appendix:5 Pages

Abstract

Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is particularly consequential in medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability.

View on arXiv

Comments on this paper