From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered

9 June 2025

Main:10 Pages

1 Figures

Bibliography:9 Pages

1 Tables

Appendix:11 Pages

Abstract

Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community's progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.

View on arXiv

@article{devic2025_2506.07461,
  title={ From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered },
  author={ Siddartha Devic and Tejas Srinivasan and Jesse Thomason and Willie Neiswanger and Vatsal Sharan },
  journal={arXiv preprint arXiv:2506.07461},
  year={ 2025 }
}

Comments on this paper