ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.01633
  4. Cited By
Missing Information, Unresponsive Authors, Experimental Flaws: The
  Impossibility of Assessing the Reproducibility of Previous Human Evaluations
  in NLP

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

2 May 2023
Anya Belz
Craig Thomson
Ehud Reiter
Gavin Abercrombie
J. Alonso-Moral
Mohammad Arvan
Jackie C.K. Cheung
Mark Cieliebak
Elizabeth Clark
Kees van Deemter
Tanvi Dinkar
Ondrej Dusek
Steffen Eger
Qixiang Fang
Mingqi Gao
Albert Gatt
Dimitra Gkatzia
Javier González-Corbelle
Dirk Hovy
Manuela Hurlimann
Takumi Ito
John D. Kelleher
Filip Klubicka
Emiel Krahmer
Huiyuan Lai
Chris van der Lee
Yiru Li
Saad Mahamood
Margot Mieskes
Emiel van Miltenburg
Pablo Romero
Malvina Nissim
Natalie Parde
Ondvrej Plátek
Verena Rieser
Jie Ruan
Joel R. Tetreault
Antonio Toral
Xiao-Yi Wan
Leo Wanner
Lewis J. Watson
Diyi Yang
ArXivPDFHTML

Papers citing "Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP"

19 / 19 papers shown
Title
SPHERE: An Evaluation Card for Human-AI Systems
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
50
0
0
24 Mar 2025
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection
Tomas Horych
Christoph Mandl
Terry Ruas
André Greiner-Petter
Bela Gipp
Akiko Aizawa
Timo Spinde
96
4
0
17 Nov 2024
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Dongryeol Lee
Yerin Hwang
Yongil Kim
Joonsuk Park
Kyomin Jung
ELM
72
5
0
28 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei-Ye Zhao
Steffen Eger
76
4
0
24 Oct 2024
A Comprehensive Survey and Classification of Evaluation Criteria for
  Trustworthy Artificial Intelligence
A Comprehensive Survey and Classification of Evaluation Criteria for Trustworthy Artificial Intelligence
Louise McCormack
Malika Bendechache
XAI
34
0
0
10 Oct 2024
Improving governance outcomes through AI documentation: Bridging theory
  and practice
Improving governance outcomes through AI documentation: Bridging theory and practice
Amy A. Winecoff
Miranda Bogen
25
2
0
13 Sep 2024
Automatic Metrics in Natural Language Generation: A Survey of Current
  Evaluation Practices
Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices
Patrícia Schmidtová
Saad Mahamood
Simone Balloccu
Ondřej Dušek
Albert Gatt
Dimitra Gkatzia
David M. Howcroft
Ondřej Plátek
Adarsa Sivaprasad
45
3
0
17 Aug 2024
Evaluating Diversity in Automatic Poetry Generation
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
Hannes Groner
Sina Zarrieß
Steffen Eger
42
8
0
21 Jun 2024
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large
  Language Models
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
Chenyang Lyu
Minghao Wu
Alham Fikri Aji
ELM
43
13
0
21 Feb 2024
Humans or LLMs as the Judge? A Study on Judgement Biases
Humans or LLMs as the Judge? A Study on Judgement Biases
Guiming Hardy Chen
Shunian Chen
Ziche Liu
Feng Jiang
Benyou Wang
82
93
0
16 Feb 2024
Verifiable evaluations of machine learning models using zkSNARKs
Verifiable evaluations of machine learning models using zkSNARKs
Tobin South
Alexander Camuto
Shrey Jain
Shayla Nguyen
Robert Mahari
Christian Paquin
Jason Morton
Alex Pentland
MLAU
ALM
37
11
0
05 Feb 2024
GPTEval: A Survey on Assessments of ChatGPT and GPT-4
GPTEval: A Survey on Assessments of ChatGPT and GPT-4
Rui Mao
Guanyi Chen
Xulang Zhang
Frank Guerin
Erik Cambria
ELM
LM&MA
33
101
0
24 Aug 2023
With a Little Help from the Authors: Reproducing Human Evaluation of an
  MT Error Detector
With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector
Ondvrej Plátek
Mateusz Lango
Ondrej Dusek
32
3
0
12 Aug 2023
Evaluating AI systems under uncertain ground truth: a case study in dermatology
Evaluating AI systems under uncertain ground truth: a case study in dermatology
David Stutz
A. Cemgil
Abhijit Guha Roy
Tatiana Matejovicova
Melih Barsbey
...
Yossi Matias
Pushmeet Kohli
Yun-hui Liu
Arnaud Doucet
Alan Karthikesalingam
33
4
0
05 Jul 2023
Understanding Counterspeech for Online Harm Mitigation
Understanding Counterspeech for Online Harm Mitigation
Yi-Ling Chung
Gavin Abercrombie
Florence E. Enock
Jonathan Bright
Verena Rieser
25
16
0
01 Jul 2023
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation
Ran Zhang
Jihed Ouni
Steffen Eger
24
6
0
22 Jun 2023
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Jifan Yu
Xiaozhi Wang
Shangqing Tu
S. Cao
Daniel Zhang-Li
...
Lei Hou
Zhiyuan Liu
Bin Xu
Jie Tang
Juanzi Li
ELM
ALM
38
66
0
15 Jun 2023
Investigating Reproducibility at Interspeech Conferences: A Longitudinal
  and Comparative Perspective
Investigating Reproducibility at Interspeech Conferences: A Longitudinal and Comparative Perspective
Mohammad Arvan
A. Seza Doğruöz
Natalie Parde
19
0
0
07 Jun 2023
Evaluating Human-Language Model Interaction
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MA
ALM
58
99
0
19 Dec 2022
1