ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

28 May 2025

Main:3 Pages

4 Figures

Bibliography:2 Pages

1 Tables

Appendix:2 Pages

Abstract

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

View on arXiv

@article{lior2025_2505.22169,
  title={ ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments },
  author={ Gili Lior and Eliya Habba and Shahar Levy and Avi Caciularu and Gabriel Stanovsky },
  journal={arXiv preprint arXiv:2505.22169},
  year={ 2025 }
}

Comments on this paper