ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.01633
66
35

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

2 May 2023
Anya Belz
Craig Thomson
Ehud Reiter
Gavin Abercrombie
J. Alonso-Moral
Mohammad Arvan
Jackie C.K. Cheung
Mark Cieliebak
Elizabeth Clark
Kees van Deemter
Tanvi Dinkar
Ondrej Dusek
Steffen Eger
Qixiang Fang
Mingqi Gao
Albert Gatt
Dimitra Gkatzia
Javier González-Corbelle
Dirk Hovy
Manuela Hurlimann
Takumi Ito
John D. Kelleher
Filip Klubicka
Emiel Krahmer
Huiyuan Lai
Chris van der Lee
Yiru Li
Saad Mahamood
Margot Mieskes
Emiel van Miltenburg
Pablo Romero
Malvina Nissim
Natalie Parde
Ondvrej Plátek
Verena Rieser
Jie Ruan
Joel R. Tetreault
Antonio Toral
Xiao-Yi Wan
Leo Wanner
Lewis J. Watson
Diyi Yang
ArXivPDFHTML
Abstract

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

View on arXiv
Comments on this paper