ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.01672
  4. Cited By
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

2 February 2021
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
Antoine Bosselut
Khyathi Raghavi Chandu
Miruna Clinciu
Dipanjan Das
Kaustubh D. Dhole
Wanyu Du
Esin Durmus
Ondrej Dusek
Chris C. Emezue
Varun Gangal
Cristina Garbacea
Tatsunori Hashimoto
Yufang Hou
Yacine Jernite
Harsh Jhamtani
Yangfeng Ji
Shailza Jolly
Mihir Kale
Dhruv Kumar
Faisal Ladhak
Aman Madaan
Mounica Maddela
Khyati Mahajan
Saad Mahamood
Bodhisattwa Prasad Majumder
Pedro Henrique Martins
Angelina McMillan-Major
Simon Mille
Emiel van Miltenburg
Moin Nadeem
Shashi Narayan
Vitaly Nikolaev
Andre Niyongabo Rubungo
Salomey Osei
Ankur P. Parikh
Laura Perez-Beltrachini
Niranjan Rao
Vikas Raunak
Juan Diego Rodriguez
Sashank Santhanam
João Sedoc
Thibault Sellam
Samira Shaikh
Anastasia Shimorina
Marco Antonio Sobrevilla Cabezudo
Hendrik Strobelt
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
    VLM
ArXivPDFHTML

Papers citing "The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics"

50 / 100 papers shown
Title
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
99
26
0
13 May 2022
UL2: Unifying Language Learning Paradigms
UL2: Unifying Language Learning Paradigms
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
57
296
0
10 May 2022
Vector Representations of Idioms in Conversational Systems
Vector Representations of Idioms in Conversational Systems
Tosin P. Adewumi
F. Liwicki
Marcus Liwicki
35
8
0
07 May 2022
When a sentence does not introduce a discourse entity, Transformer-based
  models still sometimes refer to it
When a sentence does not introduce a discourse entity, Transformer-based models still sometimes refer to it
Sebastian Schuster
Tal Linzen
13
25
0
06 May 2022
State-of-the-art in Open-domain Conversational AI: A Survey
State-of-the-art in Open-domain Conversational AI: A Survey
Tosin P. Adewumi
F. Liwicki
Marcus Liwicki
29
15
0
02 May 2022
mGPT: Few-Shot Learners Go Multilingual
mGPT: Few-Shot Learners Go Multilingual
Oleh Shliazhko
Alena Fenogenova
Maria Tikhonova
Vladislav Mikhailov
Anastasia Kozlova
Tatiana Shavrina
49
149
0
15 Apr 2022
Task2Dial: A Novel Task and Dataset for Commonsense enhanced Task-based
  Dialogue Grounded in Documents
Task2Dial: A Novel Task and Dataset for Commonsense enhanced Task-based Dialogue Grounded in Documents
Carl Strathearn
Dimitra Gkatzia
30
8
0
03 Apr 2022
Hyperdecoders: Instance-specific decoders for multi-task NLP
Hyperdecoders: Instance-specific decoders for multi-task NLP
Hamish Ivison
Matthew E. Peters
AI4CE
26
20
0
15 Mar 2022
Diversifying Content Generation for Commonsense Reasoning with Mixture
  of Knowledge Graph Experts
Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts
W. Yu
Chenguang Zhu
Lianhui Qin
Zhihan Zhang
Tong Zhao
Meng Jiang
LRM
28
31
0
14 Mar 2022
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic
  Languages
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Aman Kumar
Himani Shrotriya
P. Sahu
Raj Dabre
Ratish Puduppully
Anoop Kunchukuttan
Amogh Mishra
Mitesh M. Khapra
Pratyush Kumar
43
38
0
10 Mar 2022
Assessing the State of Self-Supervised Human Activity Recognition using
  Wearables
Assessing the State of Self-Supervised Human Activity Recognition using Wearables
H. Haresamudram
Irfan Essa
Thomas Plötz
SSL
42
86
0
22 Feb 2022
Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question
  Answering Evaluation
Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Jannis Bulian
Christian Buck
Wojciech Gajewski
Benjamin Boerschinger
Tal Schuster
26
43
0
15 Feb 2022
A Survey of Controllable Text Generation using Transformer-based
  Pre-trained Language Models
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
Hanqing Zhang
Haolin Song
Shaoyu Li
Ming Zhou
Dawei Song
52
214
0
14 Jan 2022
Measuring Attribution in Natural Language Generation Models
Measuring Attribution in Natural Language Generation Models
Hannah Rashkin
Vitaly Nikolaev
Matthew Lamm
Lora Aroyo
Michael Collins
Dipanjan Das
Slav Petrov
Gaurav Singh Tomar
Iulia Turc
David Reitter
39
173
0
23 Dec 2021
NL-Augmenter: A Framework for Task-Sensitive Natural Language
  Augmentation
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh D. Dhole
Varun Gangal
Sebastian Gehrmann
Aadesh Gupta
Zhenhao Li
...
Tianbao Xie
Usama Yaseen
Michael A. Yee
Jing Zhang
Yue Zhang
174
86
0
06 Dec 2021
LMdiff: A Visual Diff Tool to Compare Language Models
LMdiff: A Visual Diff Tool to Compare Language Models
Hendrik Strobelt
Benjamin Hoover
Arvind Satyanarayan
Sebastian Gehrmann
VLM
34
19
0
02 Nov 2021
BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation
BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation
Thomas Scialom
Felix Hill
28
7
0
18 Oct 2021
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
Xiangyang Liu
Tianxiang Sun
Junliang He
Jiawen Wu
Lingling Wu
Xinyu Zhang
Hao Jiang
Bo Zhao
Xuanjing Huang
Xipeng Qiu
ELM
28
46
0
13 Oct 2021
Truth-Conditional Captioning of Time Series Data
Truth-Conditional Captioning of Time Series Data
Harsh Jhamtani
Taylor Berg-Kirkpatrick
AI4TS
38
7
0
05 Oct 2021
Compression, Transduction, and Creation: A Unified Framework for
  Evaluating Natural Language Generation
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
Mingkai Deng
Bowen Tan
Zhengzhong Liu
Eric P. Xing
Zhiting Hu
16
72
0
14 Sep 2021
The Grammar-Learning Trajectories of Neural Language Models
The Grammar-Learning Trajectories of Neural Language Models
Leshem Choshen
Guy Hacohen
D. Weinshall
Omri Abend
29
28
0
13 Sep 2021
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
116
57
0
13 Sep 2021
Towards Natural Language Interfaces for Data Visualization: A Survey
Towards Natural Language Interfaces for Data Visualization: A Survey
Leixian Shen
Enya Shen
Yuyu Luo
Xiaocong Yang
Xuming Hu
Xiongshuai Zhang
Zhiwei Tai
Jianmin Wang
29
137
0
08 Sep 2021
Datasets: A Community Library for Natural Language Processing
Datasets: A Community Library for Natural Language Processing
Quentin Lhoest
Albert Villanova del Moral
Yacine Jernite
A. Thakur
Patrick von Platen
...
Thibault Goehringer
Victor Mustar
François Lagunas
Alexander M. Rush
Thomas Wolf
30
580
0
07 Sep 2021
Finetuned Language Models Are Zero-Shot Learners
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
35
3,576
0
03 Sep 2021
AraT5: Text-to-Text Transformers for Arabic Language Generation
AraT5: Text-to-Text Transformers for Arabic Language Generation
El Moatez Billah Nagoudi
AbdelRahim Elmadany
Muhammad Abdul-Mageed
89
118
0
31 Aug 2021
LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text
  Understanding and Generation
LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation
Jian Guan
Zhuoer Feng
Yamei Chen
Ru He
Xiaoxi Mao
Changjie Fan
Minlie Huang
39
32
0
30 Aug 2021
MTG: A Benchmark Suite for Multilingual Text Generation
MTG: A Benchmark Suite for Multilingual Text Generation
Yiran Chen
Zhenqiao Song
Xianze Wu
Danqing Wang
Jingjing Xu
Jiaze Chen
Hao Zhou
Lei Li
LRM
VLM
32
22
0
13 Aug 2021
Semantic Answer Similarity for Evaluating Question Answering Models
Semantic Answer Similarity for Evaluating Question Answering Models
Julian Risch
Timo Moller
Julian Gutsch
M. Pietsch
ELM
32
67
0
13 Aug 2021
How to Evaluate Your Dialogue Models: A Review of Approaches
How to Evaluate Your Dialogue Models: A Review of Approaches
Xinmeng Li
Wansen Wu
Long Qin
Quanjun Yin
ELM
30
8
0
03 Aug 2021
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on
  Recent Papers
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers
Mika Hämäläinen
Khalid Alnajjar
ELM
LM&MA
27
16
0
31 Jul 2021
The Benchmark Lottery
The Benchmark Lottery
Mostafa Dehghani
Yi Tay
A. Gritsenko
Zhe Zhao
N. Houlsby
Fernando Diaz
Donald Metzler
Oriol Vinyals
42
89
0
14 Jul 2021
A Survey on Data Augmentation for Text Classification
A Survey on Data Augmentation for Text Classification
Markus Bayer
M. Kaufhold
Christian A. Reuter
36
334
0
07 Jul 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
  Text
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark
Tal August
Sofia Serrano
Nikita Haduong
Suchin Gururangan
Noah A. Smith
DeLMO
45
394
0
30 Jun 2021
Automatic Construction of Evaluation Suites for Natural Language
  Generation Datasets
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
Simon Mille
Kaustubh D. Dhole
Saad Mahamood
Laura Perez-Beltrachini
Varun Gangal
Mihir Kale
Emiel van Miltenburg
Sebastian Gehrmann
ELM
42
22
0
16 Jun 2021
A Discussion on Building Practical NLP Leaderboards: The Case of Machine
  Translation
A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation
Sebastin Santy
Prasanta Bhattacharya
LLMAG
33
2
0
11 Jun 2021
Focus Attention: Promoting Faithfulness and Diversity in Summarization
Focus Attention: Promoting Faithfulness and Diversity in Summarization
Rahul Aralikatte
Shashi Narayan
Joshua Maynez
S. Rothe
Ryan T. McDonald
35
45
0
25 May 2021
Translation Quality Assessment: A Brief Survey on Manual and Automatic
  Methods
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods
Lifeng Han
Gareth J. F. Jones
Alan F. Smeaton
21
36
0
05 May 2021
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Nouha Dziri
Hannah Rashkin
Tal Linzen
David Reitter
ALM
195
79
0
30 Apr 2021
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
  Retrieval Models
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
237
971
0
17 Apr 2021
Crossing the Conversational Chasm: A Primer on Natural Language
  Processing for Multilingual Task-Oriented Dialogue Systems
Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems
E. Razumovskaia
Goran Glavavs
Olga Majewska
E. Ponti
Anna Korhonen
Ivan Vulić
23
32
0
17 Apr 2021
ExplainaBoard: An Explainable Leaderboard for NLP
ExplainaBoard: An Explainable Leaderboard for NLP
Pengfei Liu
Jinlan Fu
Yanghua Xiao
Weizhe Yuan
Shuaichen Chang
Junqi Dai
Yixin Liu
Zihuiwen Ye
Zi-Yi Dou
Graham Neubig
XAI
LRM
ELM
28
54
0
13 Apr 2021
Samanantar: The Largest Publicly Available Parallel Corpora Collection
  for 11 Indic Languages
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
Gowtham Ramesh
Sumanth Doddapaneni
Aravinth Bheemaraj
Mayank Jobanputra
AK Raghavan
...
K. Deepak
Vivek Raghavan
Anoop Kunchukuttan
Pratyush Kumar
Mitesh Khapra
LRM
37
229
0
12 Apr 2021
The Human Evaluation Datasheet 1.0: A Template for Recording Details of
  Human Evaluation Experiments in NLP
The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP
Anastasia Shimorina
Anya Belz
22
34
0
17 Mar 2021
DynaSent: A Dynamic Benchmark for Sentiment Analysis
DynaSent: A Dynamic Benchmark for Sentiment Analysis
Christopher Potts
Zhengxuan Wu
Atticus Geiger
Douwe Kiela
230
77
0
30 Dec 2020
GO FIGURE: A Meta Evaluation of Factuality in Summarization
GO FIGURE: A Meta Evaluation of Factuality in Summarization
Saadia Gabriel
Asli Celikyilmaz
Rahul Jha
Yejin Choi
Jianfeng Gao
HILM
238
96
0
24 Oct 2020
Evaluation of Text Generation: A Survey
Evaluation of Text Generation: A Survey
Asli Celikyilmaz
Elizabeth Clark
Jianfeng Gao
ELM
LM&MA
19
376
0
26 Jun 2020
How Can We Accelerate Progress Towards Human-like Linguistic
  Generalization?
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Tal Linzen
220
189
0
03 May 2020
MLQA: Evaluating Cross-lingual Extractive Question Answering
MLQA: Evaluating Cross-lingual Extractive Question Answering
Patrick Lewis
Barlas Oğuz
Ruty Rinott
Sebastian Riedel
Holger Schwenk
ELM
246
493
0
16 Oct 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,984
0
20 Apr 2018
Previous
12