Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
Practices for Generated Text

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

14 February 2022

Sebastian Gehrmann

Elizabeth Clark

Thibault Sellam

Papers citing "Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text"

18 / 68 papers shown

Title
A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication A. Luccioni Frances Corry H. Sridharan Mike Ananny J. Schultz Kate Crawford 54 29 0 18 Oct 2021
Finding a Balanced Degree of Automation for Summary Evaluation Shiyue Zhang Joey Tianyi Zhou 52 43 0 23 Sep 2021
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation Marzena Karpinska Nader Akoury Mohit Iyyer 220 106 0 14 Sep 2021
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP Anna Rogers Timothy Baldwin Kobi Leins 104 64 0 14 Sep 2021
Perturbation CheckLists for Evaluating NLG Evaluation Metrics Ananya B. Sai Tanay Dixit D. Y. Sheth S. Mohan Mitesh M. Khapra AAML 113 57 0 13 Sep 2021
Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework Anna Wegmann D. Nguyen 24 12 0 10 Sep 2021
Deduplicating Training Data Makes Language Models Better Katherine Lee Daphne Ippolito A. Nystrom Chiyuan Zhang Douglas Eck Chris Callison-Burch Nicholas Carlini SyDa 242 593 0 14 Jul 2021
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark Nouha Dziri Hannah Rashkin Tal Linzen David Reitter ALM 192 79 0 30 Apr 2021
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics Artidoro Pagnoni Vidhisha Balachandran Yulia Tsvetkov HILM 231 305 0 27 Apr 2021
Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing Boaz Shmueli Jan Fell Soumya Ray Lun-Wei Ku 108 86 0 20 Apr 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Sebastian Gehrmann Tosin P. Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Aremu Anuoluwapo ... Nishant Subramani Wei-ping Xu Diyi Yang Akhila Yerukola Jiawei Zhou VLM 254 285 0 02 Feb 2021
Robustness Gym: Unifying the NLP Evaluation Landscape Karan Goel Nazneen Rajani Jesse Vig Samson Tan Jason M. Wu Stephan Zheng Caiming Xiong Joey Tianyi Zhou Christopher Ré AAML OffRL OOD 154 136 0 13 Jan 2021
GO FIGURE: A Meta Evaluation of Factuality in Summarization Saadia Gabriel Asli Celikyilmaz Rahul Jha Yejin Choi Jianfeng Gao HILM 238 96 0 24 Oct 2020
Factual Error Correction for Abstractive Summarization Models Mengyao Cao Yue Dong Jiapeng Wu Jackie C.K. Cheung HILM KELM 169 159 0 17 Oct 2020
With Little Power Comes Great Responsibility Dallas Card Peter Henderson Urvashi Khandelwal Robin Jia Kyle Mahowald Dan Jurafsky 230 115 0 13 Oct 2020
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 254 3,684 0 28 Feb 2017
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents Ramesh Nallapati Feifei Zhai Bowen Zhou 207 1,255 0 14 Nov 2016
Teaching Machines to Read and Comprehend Karl Moritz Hermann Tomás Kociský Edward Grefenstette L. Espeholt W. Kay Mustafa Suleyman Phil Blunsom 175 3,510 0 10 Jun 2015