ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.01672
  4. Cited By
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

2 February 2021
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
Antoine Bosselut
Khyathi Raghavi Chandu
Miruna Clinciu
Dipanjan Das
Kaustubh D. Dhole
Wanyu Du
Esin Durmus
Ondrej Dusek
Chris C. Emezue
Varun Gangal
Cristina Garbacea
Tatsunori Hashimoto
Yufang Hou
Yacine Jernite
Harsh Jhamtani
Yangfeng Ji
Shailza Jolly
Mihir Kale
Dhruv Kumar
Faisal Ladhak
Aman Madaan
Mounica Maddela
Khyati Mahajan
Saad Mahamood
Bodhisattwa Prasad Majumder
Pedro Henrique Martins
Angelina McMillan-Major
Simon Mille
Emiel van Miltenburg
Moin Nadeem
Shashi Narayan
Vitaly Nikolaev
Andre Niyongabo Rubungo
Salomey Osei
Ankur P. Parikh
Laura Perez-Beltrachini
Niranjan Rao
Vikas Raunak
Juan Diego Rodriguez
Sashank Santhanam
João Sedoc
Thibault Sellam
Samira Shaikh
Anastasia Shimorina
Marco Antonio Sobrevilla Cabezudo
Hendrik Strobelt
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
    VLM
ArXivPDFHTML

Papers citing "The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics"

50 / 95 papers shown
Title
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Naome A. Etori
Kevin Lu
Randu Karisa
Arturs Kanepajs
LRM
ELM
169
0
0
14 Mar 2025
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
85
1
0
26 Oct 2024
FLARE: Faithful Logic-Aided Reasoning and Exploration
FLARE: Faithful Logic-Aided Reasoning and Exploration
Erik Arakelyan
Pasquale Minervini
Pat Verga
Patrick Lewis
Isabelle Augenstein
ReLM
LRM
69
2
0
14 Oct 2024
ExpertAF: Expert Actionable Feedback from Video
ExpertAF: Expert Actionable Feedback from Video
Kumar Ashutosh
Tushar Nagarajan
Georgios Pavlakos
Kris M. Kitani
Kristen Grauman
VGen
44
2
0
01 Aug 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
105
31
0
09 Jun 2024
Fine-Grained Natural Language Inference Based Faithfulness Evaluation
  for Diverse Summarisation Tasks
Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks
Huajian Zhang
Yumo Xu
Laura Perez-Beltrachini
HILM
32
9
0
27 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
65
29
0
02 Feb 2024
Reranking for Natural Language Generation from Logical Forms: A Study
  based on Large Language Models
Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models
Levon Haroutunian
Zhuang Li
Lucian Galescu
Philip R. Cohen
Raj Tumuluri
Gholamreza Haffari
LRM
29
1
0
21 Sep 2023
A Methodology for Generative Spelling Correction via Natural Spelling
  Errors Emulation across Multiple Domains and Languages
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Nikita Martynov
Mark Baushenko
Anastasia Kozlova
Katerina Kolomeytseva
Aleksandr Abramov
Alena Fenogenova
38
2
0
18 Aug 2023
Generative Models as a Complex Systems Science: How can we make sense of
  large language model behavior?
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
Ari Holtzman
Peter West
Luke Zettlemoyer
AI4CE
32
14
0
31 Jul 2023
Empowering Cross-lingual Behavioral Testing of NLP Models with
  Typological Features
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features
Ester Hlavnova
Sebastian Ruder
35
5
0
11 Jul 2023
Measuring the Robustness of NLP Models to Domain Shifts
Measuring the Robustness of NLP Models to Domain Shifts
Nitay Calderon
Naveh Porat
Eyal Ben-David
Alexander Chapanin
Zorik Gekhman
Nadav Oved
Vitaly Shalumov
Roi Reichart
21
7
0
31 May 2023
MuLER: Detailed and Scalable Reference-based Evaluation
MuLER: Detailed and Scalable Reference-based Evaluation
Taelin Karidi
Leshem Choshen
Gal Patel
Omri Abend
40
0
0
24 May 2023
ReSeTOX: Re-learning attention weights for toxicity mitigation in
  machine translation
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier García Gilabert
Carlos Escolano
Marta R. Costa-jussá
CLL
MU
26
2
0
19 May 2023
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for
  Longer Sequences
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
David C. Uthus
Santiago Ontañón
Joshua Ainslie
Mandy Guo
VLM
28
10
0
18 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in
  Benchmarks
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
31
5
0
17 May 2023
A Systematic Study of Knowledge Distillation for Natural Language
  Generation with Pseudo-Target Training
A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training
Nitay Calderon
Subhabrata Mukherjee
Roi Reichart
Amir Kantor
38
17
0
03 May 2023
Lay Text Summarisation Using Natural Language Processing: A Narrative
  Literature Review
Lay Text Summarisation Using Natural Language Processing: A Narrative Literature Review
Oliver Vinzelberg
M. Jenkins
Gordon Morison
David McMinn
Z. Tieges
32
6
0
24 Mar 2023
NADBenchmarks -- a compilation of Benchmark Datasets for Machine
  Learning Tasks related to Natural Disasters
NADBenchmarks -- a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters
A. Proma
Md. Saiful Islam
Stela Ciko
Raiyan Abdul Baten
E. Hoque
42
3
0
21 Dec 2022
Evaluation for Change
Evaluation for Change
Rishi Bommasani
ELM
40
0
0
20 Dec 2022
GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator
GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator
Jian Yang
Shuming Ma
Li Dong
Shaohan Huang
Haoyang Huang
Yuwei Yin
Dongdong Zhang
Liqun Yang
Furu Wei
Zhoujun Li
SyDa
AI4CE
32
25
0
20 Dec 2022
WeCheck: Strong Factual Consistency Checker via Weakly Supervised
  Learning
WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning
Wenhao Wu
Wei Li
Xinyan Xiao
Jiachen Liu
Sujian Li
Yajuan Lv
HILM
26
4
0
20 Dec 2022
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
40
44
0
20 Dec 2022
Evaluating Human-Language Model Interaction
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MA
ALM
58
99
0
19 Dec 2022
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya
Holy Lovenia
Alham Fikri Aji
Genta Indra Winata
Bryan Wilie
...
Timothy Baldwin
Sebastian Ruder
Herry Sujaini
S. Sakti
Ayu Purwarianti
39
48
0
19 Dec 2022
CiteBench: A benchmark for Scientific Citation Text Generation
CiteBench: A benchmark for Scientific Citation Text Generation
Martin Funkquist
Ilia Kuznetsov
Yufang Hou
Iryna Gurevych
31
16
0
19 Dec 2022
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
...
Simeng Han
Chenyu You
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
ALM
24
132
0
15 Dec 2022
Advancing Multilingual Pre-training: TRIP Triangular Document-level
  Pre-training for Multilingual Language Models
Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models
Hongyuan Lu
Haoyang Huang
Shuming Ma
Dongdong Zhang
W. Lam
Furu Wei
27
4
0
15 Dec 2022
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich
  Platform for Graph Learning Benchmarks
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich Platform for Graph Learning Benchmarks
Jiaqi Ma
Xingjian Zhang
Hezheng Fan
Jin Huang
Tianyue Li
Tinghong Li
Yiwen Tu
Chen Zhu
Qiaozhu Mei
40
5
0
08 Dec 2022
A Major Obstacle for NLP Research: Let's Talk about Time Allocation!
A Major Obstacle for NLP Research: Let's Talk about Time Allocation!
Katharina Kann
Shiran Dudy
Arya D. McCarthy
25
1
0
30 Nov 2022
Cognitive Simplification Operations Improve Text Simplification
Cognitive Simplification Operations Improve Text Simplification
Eytan Chamovitz
Omri Abend
35
4
0
16 Nov 2022
Follow the Wisdom of the Crowd: Effective Text Generation via Minimum
  Bayes Risk Decoding
Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding
Mirac Suzgun
Luke Melas-Kyriazi
Dan Jurafsky
30
43
0
14 Nov 2022
Time-aware Prompting for Text Generation
Time-aware Prompting for Text Generation
Shuyang Cao
Lu Wang
29
11
0
03 Nov 2022
LMentry: A Language Model Benchmark of Elementary Language Tasks
LMentry: A Language Model Benchmark of Elementary Language Tasks
Avia Efrat
Or Honovich
Omer Levy
29
19
0
03 Nov 2022
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual
  Robustness
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness
Wenhao Wu
Wei Li
Jiachen Liu
Xinyan Xiao
Ziqiang Cao
Sujian Li
Hua Wu
HILM
32
10
0
01 Nov 2022
Questioning the Validity of Summarization Datasets and Improving Their
  Factual Consistency
Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency
Yanzhu Guo
Chloé Clavel
Moussa Kamal Eddine
Michalis Vazirgiannis
HILM
32
11
0
31 Oct 2022
Universal Evasion Attacks on Summarization Scoring
Universal Evasion Attacks on Summarization Scoring
Wenchuan Mu
Kwan Hui Lim
AAML
38
1
0
25 Oct 2022
SLING: Sino Linguistic Evaluation of Large Language Models
SLING: Sino Linguistic Evaluation of Large Language Models
Yixiao Song
Kalpesh Krishna
R. Bhatt
Mohit Iyyer
24
8
0
21 Oct 2022
GLM-130B: An Open Bilingual Pre-trained Model
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng-Zhen Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
253
1,073
0
05 Oct 2022
Evaluate & Evaluation on the Hub: Better Best Practices for Data and
  Model Measurements
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro von Werra
Lewis Tunstall
A. Thakur
A. Luccioni
Tristan Thrush
...
Julien Chaumond
Margaret Mitchell
Alexander M. Rush
Thomas Wolf
Douwe Kiela
ELM
23
24
0
30 Sep 2022
A Comprehensive Survey of Natural Language Generation Advances from the
  Perspective of Digital Deception
A Comprehensive Survey of Natural Language Generation Advances from the Perspective of Digital Deception
Keenan I. Jones
Enes ALTUNCU
V. N. Franqueira
Yi-Chia Wang
Shujun Li
DeLMO
36
3
0
11 Aug 2022
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq
  Model
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
Saleh Soltan
Shankar Ananthakrishnan
Jack G. M. FitzGerald
Rahul Gupta
Wael Hamza
...
Mukund Sridhar
Fabian Triefenbach
Apurv Verma
Gokhan Tur
Premkumar Natarajan
54
82
0
02 Aug 2022
RealTime QA: What's the Answer Right Now?
RealTime QA: What's the Answer Right Now?
Jungo Kasai
Keisuke Sakaguchi
Yoichi Takahashi
Ronan Le Bras
Akari Asai
Xinyan Velocity Yu
Dragomir R. Radev
Noah A. Smith
Yejin Choi
Kentaro Inui
KELM
45
165
0
27 Jul 2022
Innovations in Neural Data-to-text Generation: A Survey
Innovations in Neural Data-to-text Generation: A Survey
Mandar Sharma
Ajay K. Gogineni
Naren Ramakrishnan
32
10
0
25 Jul 2022
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and
  Multi-Purpose Corpus of Patent Applications
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications
Mirac Suzgun
Luke Melas-Kyriazi
Suproteem K. Sarkar
S. Kominers
Stuart M. Shieber
46
26
0
08 Jul 2022
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann
Abhik Bhattacharjee
Abinaya Mahendiran
Alex Jinpeng Wang
Alexandros Papangelis
...
Yacine Jernite
Yi Xu
Yisi Sang
Yixin Liu
Yufang Hou
47
38
0
22 Jun 2022
On the Usefulness of Embeddings, Clusters and Strings for Text Generator
  Evaluation
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
Tiago Pimentel
Clara Meister
Ryan Cotterell
48
7
0
31 May 2022
The Authenticity Gap in Human Evaluation
The Authenticity Gap in Human Evaluation
Kawin Ethayarajh
Dan Jurafsky
87
24
0
24 May 2022
"I'm sorry to hear that": Finding New Biases in Language Models with a
  Holistic Descriptor Dataset
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset
Eric Michael Smith
Melissa Hall
Melanie Kambadur
Eleonora Presani
Adina Williams
79
130
0
18 May 2022
Near-Negative Distinction: Giving a Second Life to Human Evaluation
  Datasets
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
Philippe Laban
Chien-Sheng Wu
Wenhao Liu
Caiming Xiong
41
5
0
13 May 2022
12
Next