RankME: Reliable Human Ratings for Natural Language Generation

15 March 2018

Jekaterina Novikova

Papers citing "RankME: Reliable Human Ratings for Natural Language Generation"

34 / 34 papers shown

Title
The Viability of Crowdsourcing for RAG Evaluation Lukas Gienapp Tim Hagen Maik Frobe Matthias Hagen Benno Stein Martin Potthast Harrisen Scells 26 0 0 22 Apr 2025
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation Jonathan Cook Tim Rocktaschel Jakob Foerster Dennis Aumiller Alex Wang ALM 37 10 0 04 Oct 2024
DHP Benchmark: Are LLMs Good NLG Evaluators? Yicheng Wang Jiayi Yuan Yu-Neng Chuang Zhuoer Wang Yingchi Liu Mark Cusick Param Kulkarni Zhengping Ji Yasser Ibrahim Xia Hu LM&MA ELM 49 3 0 25 Aug 2024
AI-Assisted Human Evaluation of Machine Translation Vilém Zouhar Tom Kocmi Mrinmaya Sachan 48 5 0 18 Jun 2024
Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems Clemencia Siro Mohammad Aliannejadi Maarten de Rijke 43 3 0 15 Apr 2024
How Much Annotation is Needed to Compare Summarization Models? Chantal Shaib Joe Barrow Alexa F. Siu Byron C. Wallace A. Nenkova 56 2 0 28 Feb 2024
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks Anas Himmi Ekhine Irurozki Nathan Noiry Stéphan Clémençon Pierre Colombo 34 5 0 17 May 2023
LENS: A Learnable Evaluation Metric for Text Simplification Mounica Maddela Yao Dou David Heineman Wei-ping Xu 29 63 0 19 Dec 2022
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries? Saadia Gabriel Hamid Palangi Yejin Choi AAML 42 1 0 08 Nov 2022
On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Daniken Jan Deriu Don Tuggener Mark Cieliebak 30 3 0 24 Oct 2022
Risk-graded Safety for Handling Medical Queries in Conversational AI Gavin Abercrombie Verena Rieser AI4MH 38 11 0 02 Oct 2022
The Glass Ceiling of Automatic Evaluation in Natural Language Generation Pierre Colombo Maxime Peyrard Nathan Noiry Robert West Pablo Piantanida 49 11 0 31 Aug 2022
Innovations in Neural Data-to-text Generation: A Survey Mandar Sharma Ajay K. Gogineni Naren Ramakrishnan 32 10 0 25 Jul 2022
The Authenticity Gap in Human Evaluation Kawin Ethayarajh Dan Jurafsky 87 24 0 24 May 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges Shikib Mehri Jinho Choi L. F. D’Haro Jan Deriu M. Eskénazi ... David Traum Yi-Ting Yeh Zhou Yu Yizhe Zhang Chen Zhang 30 21 0 18 Mar 2022
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems Tianbo Ji Yvette Graham Gareth J. F. Jones Chenyang Lyu Qun Liu ALM 36 39 0 11 Mar 2022
Czech Grammar Error Correction with a Large and Diverse Corpus Jakub Náplava Milan Straka Jana Straková Alexandr Rosen 25 32 0 14 Jan 2022
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models Hanqing Zhang Haolin Song Shaoyu Li Ming Zhou Dawei Song 52 214 0 14 Jan 2022
Dynamic Human Evaluation for Relative Model Comparisons Thórhildur Thorleiksdóttir Cédric Renggli Nora Hollenstein Ce Zhang 38 2 0 15 Dec 2021
Better than Average: Paired Evaluation of NLP Systems Maxime Peyrard Wei-Ye Zhao Steffen Eger Robert West ELM 16 24 0 20 Oct 2021
AutoChart: A Dataset for Chart-to-Text Generation Task Jiawen Zhu Jinye Ran Roy Ka-Wei Lee Kenny Choo Zhi Li 27 15 0 16 Aug 2021
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling Emily Dinan Gavin Abercrombie A. S. Bergman Shannon L. Spruit Dirk Hovy Y-Lan Boureau Verena Rieser 43 105 0 07 Jul 2021
Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text Yao Dou Maxwell Forbes Rik Koncel-Kedziorski Noah A. Smith Yejin Choi DeLMO 17 126 0 02 Jul 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text Elizabeth Clark Tal August Sofia Serrano Nikita Haduong Suchin Gururangan Noah A. Smith DeLMO 51 394 0 30 Jun 2021
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems Craig Thomson Ehud Reiter 22 52 0 08 Nov 2020
An Evaluation Protocol for Generative Conversational Systems Seolhwa Lee Heuiseok Lim Jo˜ao Sedoc ELM 35 10 0 24 Oct 2020
Local Knowledge Powered Conversational Agents Sashank Santhanam Ming-Yu Liu Raul Puri M. Shoeybi M. Patwary Bryan Catanzaro 29 4 0 20 Oct 2020
Evaluation of Text Generation: A Survey Asli Celikyilmaz Elizabeth Clark Jianfeng Gao ELM LM&MA 19 376 0 26 Jun 2020
Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions Stephen Roller Y-Lan Boureau Jason Weston Antoine Bordes Emily Dinan ... Kurt Shuster Eric Michael Smith Arthur Szlam Jack Urbanek Mary Williamson LLMAG AI4CE 28 51 0 22 Jun 2020
A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents Amanda Cercas Curry Verena Rieser 24 31 0 10 Sep 2019
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons Margaret Li Jason Weston Stephen Roller 31 175 0 06 Sep 2019
Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge Ondrej Dusek Jekaterina Novikova Verena Rieser ELM 46 232 0 23 Jan 2019
Findings of the E2E NLG Challenge Ondrej Dusek Jekaterina Novikova Verena Rieser 20 115 0 02 Oct 2018
Efficient Online Scalar Annotation with Bounded Support Keisuke Sakaguchi Benjamin Van Durme 13 45 0 04 Jun 2018