Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1803.05928
Cited By
RankME: Reliable Human Ratings for Natural Language Generation
15 March 2018
Jekaterina Novikova
Ondrej Dusek
Verena Rieser
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RankME: Reliable Human Ratings for Natural Language Generation"
34 / 34 papers shown
Title
The Viability of Crowdsourcing for RAG Evaluation
Lukas Gienapp
Tim Hagen
Maik Frobe
Matthias Hagen
Benno Stein
Martin Potthast
Harrisen Scells
26
0
0
22 Apr 2025
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation
Jonathan Cook
Tim Rocktaschel
Jakob Foerster
Dennis Aumiller
Alex Wang
ALM
37
10
0
04 Oct 2024
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
Jiayi Yuan
Yu-Neng Chuang
Zhuoer Wang
Yingchi Liu
Mark Cusick
Param Kulkarni
Zhengping Ji
Yasser Ibrahim
Xia Hu
LM&MA
ELM
49
3
0
25 Aug 2024
AI-Assisted Human Evaluation of Machine Translation
Vilém Zouhar
Tom Kocmi
Mrinmaya Sachan
48
5
0
18 Jun 2024
Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems
Clemencia Siro
Mohammad Aliannejadi
Maarten de Rijke
43
3
0
15 Apr 2024
How Much Annotation is Needed to Compare Summarization Models?
Chantal Shaib
Joe Barrow
Alexa F. Siu
Byron C. Wallace
A. Nenkova
56
2
0
28 Feb 2024
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
34
5
0
17 May 2023
LENS: A Learnable Evaluation Metric for Text Simplification
Mounica Maddela
Yao Dou
David Heineman
Wei-ping Xu
29
63
0
19 Dec 2022
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?
Saadia Gabriel
Hamid Palangi
Yejin Choi
AAML
42
1
0
08 Nov 2022
On the Effectiveness of Automated Metrics for Text Generation Systems
Pius von Daniken
Jan Deriu
Don Tuggener
Mark Cieliebak
30
3
0
24 Oct 2022
Risk-graded Safety for Handling Medical Queries in Conversational AI
Gavin Abercrombie
Verena Rieser
AI4MH
38
11
0
02 Oct 2022
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo
Maxime Peyrard
Nathan Noiry
Robert West
Pablo Piantanida
49
11
0
31 Aug 2022
Innovations in Neural Data-to-text Generation: A Survey
Mandar Sharma
Ajay K. Gogineni
Naren Ramakrishnan
32
10
0
25 Jul 2022
The Authenticity Gap in Human Evaluation
Kawin Ethayarajh
Dan Jurafsky
87
24
0
24 May 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Shikib Mehri
Jinho Choi
L. F. D’Haro
Jan Deriu
M. Eskénazi
...
David Traum
Yi-Ting Yeh
Zhou Yu
Yizhe Zhang
Chen Zhang
30
21
0
18 Mar 2022
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
Tianbo Ji
Yvette Graham
Gareth J. F. Jones
Chenyang Lyu
Qun Liu
ALM
36
39
0
11 Mar 2022
Czech Grammar Error Correction with a Large and Diverse Corpus
Jakub Náplava
Milan Straka
Jana Straková
Alexandr Rosen
25
32
0
14 Jan 2022
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
Hanqing Zhang
Haolin Song
Shaoyu Li
Ming Zhou
Dawei Song
52
214
0
14 Jan 2022
Dynamic Human Evaluation for Relative Model Comparisons
Thórhildur Thorleiksdóttir
Cédric Renggli
Nora Hollenstein
Ce Zhang
38
2
0
15 Dec 2021
Better than Average: Paired Evaluation of NLP Systems
Maxime Peyrard
Wei-Ye Zhao
Steffen Eger
Robert West
ELM
16
24
0
20 Oct 2021
AutoChart: A Dataset for Chart-to-Text Generation Task
Jiawen Zhu
Jinye Ran
Roy Ka-Wei Lee
Kenny Choo
Zhi Li
27
15
0
16 Aug 2021
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
Emily Dinan
Gavin Abercrombie
A. S. Bergman
Shannon L. Spruit
Dirk Hovy
Y-Lan Boureau
Verena Rieser
43
105
0
07 Jul 2021
Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text
Yao Dou
Maxwell Forbes
Rik Koncel-Kedziorski
Noah A. Smith
Yejin Choi
DeLMO
17
126
0
02 Jul 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark
Tal August
Sofia Serrano
Nikita Haduong
Suchin Gururangan
Noah A. Smith
DeLMO
51
394
0
30 Jun 2021
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Craig Thomson
Ehud Reiter
22
52
0
08 Nov 2020
An Evaluation Protocol for Generative Conversational Systems
Seolhwa Lee
Heuiseok Lim
Jo˜ao Sedoc
ELM
35
10
0
24 Oct 2020
Local Knowledge Powered Conversational Agents
Sashank Santhanam
Ming-Yu Liu
Raul Puri
M. Shoeybi
M. Patwary
Bryan Catanzaro
29
4
0
20 Oct 2020
Evaluation of Text Generation: A Survey
Asli Celikyilmaz
Elizabeth Clark
Jianfeng Gao
ELM
LM&MA
19
376
0
26 Jun 2020
Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions
Stephen Roller
Y-Lan Boureau
Jason Weston
Antoine Bordes
Emily Dinan
...
Kurt Shuster
Eric Michael Smith
Arthur Szlam
Jack Urbanek
Mary Williamson
LLMAG
AI4CE
28
51
0
22 Jun 2020
A Crowd-based Evaluation of Abuse Response Strategies in Conversational Agents
Amanda Cercas Curry
Verena Rieser
24
31
0
10 Sep 2019
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons
Margaret Li
Jason Weston
Stephen Roller
31
175
0
06 Sep 2019
Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge
Ondrej Dusek
Jekaterina Novikova
Verena Rieser
ELM
46
232
0
23 Jan 2019
Findings of the E2E NLG Challenge
Ondrej Dusek
Jekaterina Novikova
Verena Rieser
20
115
0
02 Oct 2018
Efficient Online Scalar Annotation with Bounded Support
Keisuke Sakaguchi
Benjamin Van Durme
13
45
0
04 Jun 2018
1