The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

14 September 2021

Papers citing "The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation"

26 / 26 papers shown

Title
SPHERE: An Evaluation Card for Human-AI Systems Qianou Ma Dora Zhao Xinran Zhao Chenglei Si Chenyang Yang Ryan Louie Ehud Reiter Diyi Yang Tongshuang Wu ALM 50 0 0 24 Mar 2025
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation Zhaopeng Feng Jiayuan Su Jiamei Zheng Jiahan Ren Yan Zhang Jian Wu Hongwei Wang Zuozhu Liu ELM 203 0 0 21 Feb 2025
Investigating Non-Transitivity in LLM-as-a-Judge Yi Xu Laura Ruis Tim Rocktaschel Robert Kirk 40 0 0 19 Feb 2025
Economics of Sourcing Human Data Sebastin Santy Prasanta Bhattacharya Manoel Horta Ribeiro Kelsey Allen Sewoong Oh 71 0 0 11 Feb 2025
A Collection of Question Answering Datasets for Norwegian Vladislav Mikhailov Petter Mæhlum Victoria Ovedie Chruickshank Langø Erik Velldal Lilja Øvrelid RALM 41 4 0 19 Jan 2025
MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation Yan Ma Yu Qiao Pengfei Liu 32 5 0 09 Jun 2024
Natural Language Processing RELIES on Linguistics Juri Opitz Shira Wein Nathan Schneider AI4CE 52 7 0 09 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations Preetam Prabhu Srikar Dammu Hayoung Jung Anjali Singh Monojit Choudhury Tanushree Mitra 32 8 0 08 May 2024
Evaluating Optimal Reference Translations Vilém Zouhar Vvera Kloudová Martin Popel Ondrej Bojar 29 2 0 28 Nov 2023
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing Carlos Gómez-Rodríguez Paul Williams 29 65 0 12 Oct 2023
Learning Personalized Alignment for Evaluating Open-ended Text Generation Danqing Wang Kevin Kaichuang Yang Hanlin Zhu Xiaomeng Yang Andrew Cohen Lei Li Yuandong Tian ALM LM&MA 15 8 0 05 Oct 2023
Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation David Heineman Yao Dou Wei-ping Xu 26 7 0 14 Aug 2023
GIO: Gradient Information Optimization for Training Dataset Selection Dante Everaert Christopher Potts 21 3 0 20 Jun 2023
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond Haw-Shiuan Chang Zonghai Yao Alolika Gon Hong-ye Yu Andrew McCallum 43 10 0 20 May 2023
Large language models effectively leverage document-level context for literary translation, but critical errors persist Marzena Karpinska Mohit Iyyer 31 81 0 06 Apr 2023
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation Mayu Otani Riku Togashi Yu Sawai Ryosuke Ishigami Yuta Nakashima Esa Rahtu J. Heikkilä Shiníchi Satoh 33 62 0 04 Apr 2023
In BLOOM: Creativity and Affinity in Artificial Lyrics and Art Evan Crothers H. Viktor Nathalie Japkowicz 30 3 0 13 Jan 2023
MAUVE Scores for Generative Models: Theory and Practice Krishna Pillutla Lang Liu John Thickstun Sean Welleck Swabha Swayamdipta Rowan Zellers Sewoong Oh Yejin Choi Zaïd Harchaoui EGVM 35 21 0 30 Dec 2022
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation Cyril Chhun Pierre Colombo Chloé Clavel Fabian M. Suchanek 53 50 0 24 Aug 2022
Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian T. Shamardina Vladislav Mikhailov Daniil Chernianskii Alena Fenogenova Marat Saidov A. Valeeva Tatiana Shavrina I. Smurov E. Tutubalina Ekaterina Artemova DeLMO 16 30 0 03 Jun 2022
RankGen: Improving Text Generation with Large Ranking Models Kalpesh Krishna Yapei Chang John Wieting Mohit Iyyer AIMat 18 68 0 19 May 2022
SNaC: Coherence Error Detection for Narrative Summarization Tanya Goyal Junyi Jessy Li Greg Durrett 27 27 0 19 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications Kaitlyn Zhou Su Lin Blodgett Adam Trischler Hal Daumé Kaheer Suleman Alexandra Olteanu ELM 94 26 0 13 May 2022
HydraSum: Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models Tanya Goyal Nazneen Rajani Wenhao Liu Wojciech Kry'sciñski AI4CE 15 12 0 08 Oct 2021
MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers Krishna Pillutla Swabha Swayamdipta Rowan Zellers John Thickstun Sean Welleck Yejin Choi Zaïd Harchaoui 37 341 0 02 Feb 2021
With Little Power Comes Great Responsibility Dallas Card Peter Henderson Urvashi Khandelwal Robin Jia Kyle Mahowald Dan Jurafsky 227 115 0 13 Oct 2020