The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

14 September 2021

Papers citing "The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation"

23 / 23 papers shown

Title
SPHERE: An Evaluation Card for Human-AI Systems Qianou Ma Dora Zhao Xinran Zhao Chenglei Si Chenyang Yang Ryan Louie Ehud Reiter Diyi Yang Tongshuang Wu ALM 50 0 0 24 Mar 2025
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation Zhaopeng Feng Jiayuan Su Jiamei Zheng Jiahan Ren Yan Zhang Jian Wu Hongwei Wang Zuozhu Liu ELM 203 0 0 21 Feb 2025
Investigating Non-Transitivity in LLM-as-a-Judge Yi Xu Laura Ruis Tim Rocktaschel Robert Kirk 38 0 0 19 Feb 2025
Economics of Sourcing Human Data Sebastin Santy Prasanta Bhattacharya Manoel Horta Ribeiro Kelsey Allen Sewoong Oh 69 0 0 11 Feb 2025
A Collection of Question Answering Datasets for Norwegian Vladislav Mikhailov Petter Mæhlum Victoria Ovedie Chruickshank Langø Erik Velldal Lilja Øvrelid RALM 41 4 0 19 Jan 2025
Natural Language Processing RELIES on Linguistics Juri Opitz Shira Wein Nathan Schneider AI4CE 52 7 0 09 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations Preetam Prabhu Srikar Dammu Hayoung Jung Anjali Singh Monojit Choudhury Tanushree Mitra 32 8 0 08 May 2024
Evaluating Optimal Reference Translations Vilém Zouhar Vvera Kloudová Martin Popel Ondrej Bojar 29 2 0 28 Nov 2023
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing Carlos Gómez-Rodríguez Paul Williams 29 65 0 12 Oct 2023
Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation David Heineman Yao Dou Wei-ping Xu 22 7 0 14 Aug 2023
GIO: Gradient Information Optimization for Training Dataset Selection Dante Everaert Christopher Potts 21 3 0 20 Jun 2023
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond Haw-Shiuan Chang Zonghai Yao Alolika Gon Hong-ye Yu Andrew McCallum 43 10 0 20 May 2023
Large language models effectively leverage document-level context for literary translation, but critical errors persist Marzena Karpinska Mohit Iyyer 31 81 0 06 Apr 2023
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation Mayu Otani Riku Togashi Yu Sawai Ryosuke Ishigami Yuta Nakashima Esa Rahtu J. Heikkilä Shiníchi Satoh 33 62 0 04 Apr 2023
In BLOOM: Creativity and Affinity in Artificial Lyrics and Art Evan Crothers H. Viktor Nathalie Japkowicz 30 3 0 13 Jan 2023
MAUVE Scores for Generative Models: Theory and Practice Krishna Pillutla Lang Liu John Thickstun Sean Welleck Swabha Swayamdipta Rowan Zellers Sewoong Oh Yejin Choi Zaïd Harchaoui EGVM 31 21 0 30 Dec 2022
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation Cyril Chhun Pierre Colombo Chloé Clavel Fabian M. Suchanek 51 50 0 24 Aug 2022
RankGen: Improving Text Generation with Large Ranking Models Kalpesh Krishna Yapei Chang John Wieting Mohit Iyyer AIMat 16 68 0 19 May 2022
SNaC: Coherence Error Detection for Narrative Summarization Tanya Goyal Junyi Jessy Li Greg Durrett 24 27 0 19 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications Kaitlyn Zhou Su Lin Blodgett Adam Trischler Hal Daumé Kaheer Suleman Alexandra Olteanu ELM 94 26 0 13 May 2022
HydraSum: Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models Tanya Goyal Nazneen Rajani Wenhao Liu Wojciech Kry'sciñski AI4CE 15 12 0 08 Oct 2021
MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers Krishna Pillutla Swabha Swayamdipta Rowan Zellers John Thickstun Sean Welleck Yejin Choi Zaïd Harchaoui 37 341 0 02 Feb 2021
With Little Power Comes Great Responsibility Dallas Card Peter Henderson Urvashi Khandelwal Robin Jia Kyle Mahowald Dan Jurafsky 225 115 0 13 Oct 2020