Title
SPHERE: An Evaluation Card for Human-AI Systems Qianou Ma Dora Zhao Xinran Zhao Chenglei Si Chenyang Yang Ryan Louie Ehud Reiter Diyi Yang Tongshuang Wu ALM 50 0 0 24 Mar 2025
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection Tomas Horych Christoph Mandl Terry Ruas André Greiner-Petter Bela Gipp Akiko Aizawa Timo Spinde 96 4 0 17 Nov 2024
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation Dongryeol Lee Yerin Hwang Yongil Kim Joonsuk Park Kyomin Jung ELM 72 5 0 28 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs Ran Zhang Wei-Ye Zhao Steffen Eger 76 4 0 24 Oct 2024
A Comprehensive Survey and Classification of Evaluation Criteria for Trustworthy Artificial Intelligence Louise McCormack Malika Bendechache XAI 34 0 0 10 Oct 2024
Improving governance outcomes through AI documentation: Bridging theory and practice Amy A. Winecoff Miranda Bogen 25 2 0 13 Sep 2024
Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices Patrícia Schmidtová Saad Mahamood Simone Balloccu Ondřej Dušek Albert Gatt Dimitra Gkatzia David M. Howcroft Ondřej Plátek Adarsa Sivaprasad 45 3 0 17 Aug 2024
Evaluating Diversity in Automatic Poetry Generation Yanran Chen Hannes Groner Sina Zarrieß Steffen Eger 42 8 0 21 Jun 2024
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models Chenyang Lyu Minghao Wu Alham Fikri Aji ELM 43 13 0 21 Feb 2024
Humans or LLMs as the Judge? A Study on Judgement Biases Guiming Hardy Chen Shunian Chen Ziche Liu Feng Jiang Benyou Wang 82 93 0 16 Feb 2024
Verifiable evaluations of machine learning models using zkSNARKs Tobin South Alexander Camuto Shrey Jain Shayla Nguyen Robert Mahari Christian Paquin Jason Morton Alex Pentland MLAU ALM 37 11 0 05 Feb 2024
GPTEval: A Survey on Assessments of ChatGPT and GPT-4 Rui Mao Guanyi Chen Xulang Zhang Frank Guerin Erik Cambria ELM LM&MA 33 101 0 24 Aug 2023
With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector Ondvrej Plátek Mateusz Lango Ondrej Dusek 32 3 0 12 Aug 2023
Evaluating AI systems under uncertain ground truth: a case study in dermatology David Stutz A. Cemgil Abhijit Guha Roy Tatiana Matejovicova Melih Barsbey ... Yossi Matias Pushmeet Kohli Yun-hui Liu Arnaud Doucet Alan Karthikesalingam 33 4 0 05 Jul 2023
Understanding Counterspeech for Online Harm Mitigation Yi-Ling Chung Gavin Abercrombie Florence E. Enock Jonathan Bright Verena Rieser 25 16 0 01 Jul 2023
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation Ran Zhang Jihed Ouni Steffen Eger 24 6 0 22 Jun 2023
KoLA: Carefully Benchmarking World Knowledge of Large Language Models Jifan Yu Xiaozhi Wang Shangqing Tu S. Cao Daniel Zhang-Li ... Lei Hou Zhiyuan Liu Bin Xu Jie Tang Juanzi Li ELM ALM 38 66 0 15 Jun 2023
Investigating Reproducibility at Interspeech Conferences: A Longitudinal and Comparative Perspective Mohammad Arvan A. Seza Doğruöz Natalie Parde 19 0 0 07 Jun 2023
Evaluating Human-Language Model Interaction Mina Lee Megha Srivastava Amelia Hardy John Thickstun Esin Durmus ... Hancheng Cao Tony Lee Rishi Bommasani Michael S. Bernstein Percy Liang LM&MA ALM 58 99 0 19 Dec 2022