ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.18403
46
62

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

26 June 2024
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
Albert Gatt
E. Ghaleb
Mario Giulianelli
Michael Hanna
Alexander Koller
André F. T. Martins
Philipp Mondorf
Vera Neplenbroek
Sandro Pezzelle
Barbara Plank
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
    ALM
    ELM
ArXivPDFHTML
Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

View on arXiv
Comments on this paper