ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.06835
  4. Cited By
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text
  Generation

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

14 September 2021
Marzena Karpinska
Nader Akoury
Mohit Iyyer
ArXivPDFHTML

Papers citing "The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation"

23 / 23 papers shown
Title
SPHERE: An Evaluation Card for Human-AI Systems
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
50
0
0
24 Mar 2025
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
Zhaopeng Feng
Jiayuan Su
Jiamei Zheng
Jiahan Ren
Yan Zhang
Jian Wu
Hongwei Wang
Zuozhu Liu
ELM
203
0
0
21 Feb 2025
Investigating Non-Transitivity in LLM-as-a-Judge
Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu
Laura Ruis
Tim Rocktaschel
Robert Kirk
38
0
0
19 Feb 2025
Economics of Sourcing Human Data
Economics of Sourcing Human Data
Sebastin Santy
Prasanta Bhattacharya
Manoel Horta Ribeiro
Kelsey Allen
Sewoong Oh
69
0
0
11 Feb 2025
A Collection of Question Answering Datasets for Norwegian
A Collection of Question Answering Datasets for Norwegian
Vladislav Mikhailov
Petter Mæhlum
Victoria Ovedie Chruickshank Langø
Erik Velldal
Lilja Øvrelid
RALM
41
4
0
19 Jan 2025
Natural Language Processing RELIES on Linguistics
Natural Language Processing RELIES on Linguistics
Juri Opitz
Shira Wein
Nathan Schneider
AI4CE
52
7
0
09 May 2024
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM
  Generated Conversations
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Preetam Prabhu Srikar Dammu
Hayoung Jung
Anjali Singh
Monojit Choudhury
Tanushree Mitra
32
8
0
08 May 2024
Evaluating Optimal Reference Translations
Evaluating Optimal Reference Translations
Vilém Zouhar
Vvera Kloudová
Martin Popel
Ondrej Bojar
29
2
0
28 Nov 2023
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative
  Writing
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing
Carlos Gómez-Rodríguez
Paul Williams
29
65
0
12 Oct 2023
Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained
  Text Evaluation
Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation
David Heineman
Yao Dou
Wei-ping Xu
22
7
0
14 Aug 2023
GIO: Gradient Information Optimization for Training Dataset Selection
GIO: Gradient Information Optimization for Training Dataset Selection
Dante Everaert
Christopher Potts
21
3
0
20 Jun 2023
Revisiting the Architectures like Pointer Networks to Efficiently
  Improve the Next Word Distribution, Summarization Factuality, and Beyond
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond
Haw-Shiuan Chang
Zonghai Yao
Alolika Gon
Hong-ye Yu
Andrew McCallum
43
10
0
20 May 2023
Large language models effectively leverage document-level context for
  literary translation, but critical errors persist
Large language models effectively leverage document-level context for literary translation, but critical errors persist
Marzena Karpinska
Mohit Iyyer
31
81
0
06 Apr 2023
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
  Generation
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Mayu Otani
Riku Togashi
Yu Sawai
Ryosuke Ishigami
Yuta Nakashima
Esa Rahtu
J. Heikkilä
Shiníchi Satoh
33
62
0
04 Apr 2023
In BLOOM: Creativity and Affinity in Artificial Lyrics and Art
In BLOOM: Creativity and Affinity in Artificial Lyrics and Art
Evan Crothers
H. Viktor
Nathalie Japkowicz
30
3
0
13 Jan 2023
MAUVE Scores for Generative Models: Theory and Practice
MAUVE Scores for Generative Models: Theory and Practice
Krishna Pillutla
Lang Liu
John Thickstun
Sean Welleck
Swabha Swayamdipta
Rowan Zellers
Sewoong Oh
Yejin Choi
Zaïd Harchaoui
EGVM
31
21
0
30 Dec 2022
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation
  of Story Generation
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun
Pierre Colombo
Chloé Clavel
Fabian M. Suchanek
51
50
0
24 Aug 2022
RankGen: Improving Text Generation with Large Ranking Models
RankGen: Improving Text Generation with Large Ranking Models
Kalpesh Krishna
Yapei Chang
John Wieting
Mohit Iyyer
AIMat
16
68
0
19 May 2022
SNaC: Coherence Error Detection for Narrative Summarization
SNaC: Coherence Error Detection for Narrative Summarization
Tanya Goyal
Junyi Jessy Li
Greg Durrett
24
27
0
19 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
94
26
0
13 May 2022
HydraSum: Disentangling Stylistic Features in Text Summarization using
  Multi-Decoder Models
HydraSum: Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models
Tanya Goyal
Nazneen Rajani
Wenhao Liu
Wojciech Kry'sciñski
AI4CE
15
12
0
08 Oct 2021
MAUVE: Measuring the Gap Between Neural Text and Human Text using
  Divergence Frontiers
MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers
Krishna Pillutla
Swabha Swayamdipta
Rowan Zellers
John Thickstun
Sean Welleck
Yejin Choi
Zaïd Harchaoui
37
341
0
02 Feb 2021
With Little Power Comes Great Responsibility
With Little Power Comes Great Responsibility
Dallas Card
Peter Henderson
Urvashi Khandelwal
Robin Jia
Kyle Mahowald
Dan Jurafsky
225
115
0
13 Oct 2020
1