SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

21 May 2025

Main:1 Pages

5 Tables

Appendix:8 Pages

Abstract

The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks.In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.

View on arXiv

@article{daynauth2025_2505.16003,
  title={ SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models },
  author={ Roland Daynauth and Christopher Clarke and Krisztian Flautner and Lingjia Tang and Jason Mars },
  journal={arXiv preprint arXiv:2505.16003},
  year={ 2025 }
}

Comments on this paper