ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.01672
  4. Cited By
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

2 February 2021
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
Antoine Bosselut
Khyathi Raghavi Chandu
Miruna Clinciu
Dipanjan Das
Kaustubh D. Dhole
Wanyu Du
Esin Durmus
Ondrej Dusek
Chris C. Emezue
Varun Gangal
Cristina Garbacea
Tatsunori Hashimoto
Yufang Hou
Yacine Jernite
Harsh Jhamtani
Yangfeng Ji
Shailza Jolly
Mihir Kale
Dhruv Kumar
Faisal Ladhak
Aman Madaan
Mounica Maddela
Khyati Mahajan
Saad Mahamood
Bodhisattwa Prasad Majumder
Pedro Henrique Martins
Angelina McMillan-Major
Simon Mille
Emiel van Miltenburg
Moin Nadeem
Shashi Narayan
Vitaly Nikolaev
Andre Niyongabo Rubungo
Salomey Osei
Ankur P. Parikh
Laura Perez-Beltrachini
Niranjan Rao
Vikas Raunak
Juan Diego Rodriguez
Sashank Santhanam
João Sedoc
Thibault Sellam
Samira Shaikh
Anastasia Shimorina
Marco Antonio Sobrevilla Cabezudo
Hendrik Strobelt
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
    VLM
ArXivPDFHTML

Papers citing "The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics"

50 / 88 papers shown
Title
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Naome A. Etori
Kevin Lu
Randu Karisa
Arturs Kanepajs
LRM
ELM
160
0
0
14 Mar 2025
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
76
1
0
26 Oct 2024
FLARE: Faithful Logic-Aided Reasoning and Exploration
FLARE: Faithful Logic-Aided Reasoning and Exploration
Erik Arakelyan
Pasquale Minervini
Pat Verga
Patrick Lewis
Isabelle Augenstein
ReLM
LRM
66
2
0
14 Oct 2024
ExpertAF: Expert Actionable Feedback from Video
ExpertAF: Expert Actionable Feedback from Video
Kumar Ashutosh
Tushar Nagarajan
Georgios Pavlakos
Kris M. Kitani
Kristen Grauman
VGen
44
2
0
01 Aug 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
97
31
0
09 Jun 2024
Fine-Grained Natural Language Inference Based Faithfulness Evaluation
  for Diverse Summarisation Tasks
Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks
Huajian Zhang
Yumo Xu
Laura Perez-Beltrachini
HILM
32
9
0
27 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
57
29
0
02 Feb 2024
Reranking for Natural Language Generation from Logical Forms: A Study
  based on Large Language Models
Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models
Levon Haroutunian
Zhuang Li
Lucian Galescu
Philip R. Cohen
Raj Tumuluri
Gholamreza Haffari
LRM
26
1
0
21 Sep 2023
A Methodology for Generative Spelling Correction via Natural Spelling
  Errors Emulation across Multiple Domains and Languages
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Nikita Martynov
Mark Baushenko
Anastasia Kozlova
Katerina Kolomeytseva
Aleksandr Abramov
Alena Fenogenova
35
2
0
18 Aug 2023
Generative Models as a Complex Systems Science: How can we make sense of
  large language model behavior?
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
Ari Holtzman
Peter West
Luke Zettlemoyer
AI4CE
30
14
0
31 Jul 2023
Empowering Cross-lingual Behavioral Testing of NLP Models with
  Typological Features
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features
Ester Hlavnova
Sebastian Ruder
30
5
0
11 Jul 2023
MuLER: Detailed and Scalable Reference-based Evaluation
MuLER: Detailed and Scalable Reference-based Evaluation
Taelin Karidi
Leshem Choshen
Gal Patel
Omri Abend
30
0
0
24 May 2023
ReSeTOX: Re-learning attention weights for toxicity mitigation in
  machine translation
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier García Gilabert
Carlos Escolano
Marta R. Costa-jussá
CLL
MU
21
2
0
19 May 2023
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for
  Longer Sequences
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
David C. Uthus
Santiago Ontañón
Joshua Ainslie
Mandy Guo
VLM
28
10
0
18 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in
  Benchmarks
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
28
5
0
17 May 2023
A Systematic Study of Knowledge Distillation for Natural Language
  Generation with Pseudo-Target Training
A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training
Nitay Calderon
Subhabrata Mukherjee
Roi Reichart
Amir Kantor
31
17
0
03 May 2023
Lay Text Summarisation Using Natural Language Processing: A Narrative
  Literature Review
Lay Text Summarisation Using Natural Language Processing: A Narrative Literature Review
Oliver Vinzelberg
M. Jenkins
Gordon Morison
David McMinn
Z. Tieges
27
6
0
24 Mar 2023
NADBenchmarks -- a compilation of Benchmark Datasets for Machine
  Learning Tasks related to Natural Disasters
NADBenchmarks -- a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters
A. Proma
Md. Saiful Islam
Stela Ciko
Raiyan Abdul Baten
E. Hoque
34
3
0
21 Dec 2022
Evaluation for Change
Evaluation for Change
Rishi Bommasani
ELM
40
0
0
20 Dec 2022
GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator
GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator
Jian Yang
Shuming Ma
Li Dong
Shaohan Huang
Haoyang Huang
Yuwei Yin
Dongdong Zhang
Liqun Yang
Furu Wei
Zhoujun Li
SyDa
AI4CE
32
25
0
20 Dec 2022
WeCheck: Strong Factual Consistency Checker via Weakly Supervised
  Learning
WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning
Wenhao Wu
Wei Li
Xinyan Xiao
Jiachen Liu
Sujian Li
Yajuan Lv
HILM
26
4
0
20 Dec 2022
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
40
44
0
20 Dec 2022
Evaluating Human-Language Model Interaction
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MA
ALM
58
98
0
19 Dec 2022
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya
Holy Lovenia
Alham Fikri Aji
Genta Indra Winata
Bryan Wilie
...
Timothy Baldwin
Sebastian Ruder
Herry Sujaini
S. Sakti
Ayu Purwarianti
31
48
0
19 Dec 2022
CiteBench: A benchmark for Scientific Citation Text Generation
CiteBench: A benchmark for Scientific Citation Text Generation
Martin Funkquist
Ilia Kuznetsov
Yufang Hou
Iryna Gurevych
25
16
0
19 Dec 2022
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
...
Simeng Han
Shafiq R. Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
ALM
15
132
0
15 Dec 2022
Advancing Multilingual Pre-training: TRIP Triangular Document-level
  Pre-training for Multilingual Language Models
Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models
Hongyuan Lu
Haoyang Huang
Shuming Ma
Dongdong Zhang
W. Lam
Furu Wei
27
4
0
15 Dec 2022
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich
  Platform for Graph Learning Benchmarks
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich Platform for Graph Learning Benchmarks
Jiaqi Ma
Xingjian Zhang
Hezheng Fan
Jin Huang
Tianyue Li
Tinghong Li
Yiwen Tu
Chen Zhu
Qiaozhu Mei
40
5
0
08 Dec 2022
Cognitive Simplification Operations Improve Text Simplification
Cognitive Simplification Operations Improve Text Simplification
Eytan Chamovitz
Omri Abend
27
4
0
16 Nov 2022
Follow the Wisdom of the Crowd: Effective Text Generation via Minimum
  Bayes Risk Decoding
Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding
Mirac Suzgun
Luke Melas-Kyriazi
Dan Jurafsky
22
43
0
14 Nov 2022
Time-aware Prompting for Text Generation
Time-aware Prompting for Text Generation
Shuyang Cao
Lu Wang
19
11
0
03 Nov 2022
LMentry: A Language Model Benchmark of Elementary Language Tasks
LMentry: A Language Model Benchmark of Elementary Language Tasks
Avia Efrat
Or Honovich
Omer Levy
29
19
0
03 Nov 2022
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual
  Robustness
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness
Wenhao Wu
Wei Li
Jiachen Liu
Xinyan Xiao
Ziqiang Cao
Sujian Li
Hua-Hong Wu
HILM
21
10
0
01 Nov 2022
Universal Evasion Attacks on Summarization Scoring
Universal Evasion Attacks on Summarization Scoring
Wenchuan Mu
Kwan Hui Lim
AAML
32
1
0
25 Oct 2022
SLING: Sino Linguistic Evaluation of Large Language Models
SLING: Sino Linguistic Evaluation of Large Language Models
Yixiao Song
Kalpesh Krishna
R. Bhatt
Mohit Iyyer
21
8
0
21 Oct 2022
Evaluate & Evaluation on the Hub: Better Best Practices for Data and
  Model Measurements
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro von Werra
Lewis Tunstall
A. Thakur
A. Luccioni
Tristan Thrush
...
Julien Chaumond
Margaret Mitchell
Alexander M. Rush
Thomas Wolf
Douwe Kiela
ELM
23
24
0
30 Sep 2022
A Comprehensive Survey of Natural Language Generation Advances from the
  Perspective of Digital Deception
A Comprehensive Survey of Natural Language Generation Advances from the Perspective of Digital Deception
Keenan I. Jones
Enes ALTUNCU
V. N. Franqueira
Yi-Chia Wang
Shujun Li
DeLMO
34
3
0
11 Aug 2022
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq
  Model
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
Saleh Soltan
Shankar Ananthakrishnan
Jack G. M. FitzGerald
Rahul Gupta
Wael Hamza
...
Mukund Sridhar
Fabian Triefenbach
Apurv Verma
Gokhan Tur
Premkumar Natarajan
51
82
0
02 Aug 2022
RealTime QA: What's the Answer Right Now?
RealTime QA: What's the Answer Right Now?
Jungo Kasai
Keisuke Sakaguchi
Yoichi Takahashi
Ronan Le Bras
Akari Asai
Xinyan Velocity Yu
Dragomir R. Radev
Noah A. Smith
Yejin Choi
Kentaro Inui
KELM
45
165
0
27 Jul 2022
Innovations in Neural Data-to-text Generation: A Survey
Innovations in Neural Data-to-text Generation: A Survey
Mandar Sharma
Ajay K. Gogineni
Naren Ramakrishnan
29
10
0
25 Jul 2022
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and
  Multi-Purpose Corpus of Patent Applications
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications
Mirac Suzgun
Luke Melas-Kyriazi
Suproteem K. Sarkar
S. Kominers
Stuart M. Shieber
46
25
0
08 Jul 2022
On the Usefulness of Embeddings, Clusters and Strings for Text Generator
  Evaluation
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
Tiago Pimentel
Clara Meister
Ryan Cotterell
40
7
0
31 May 2022
The Authenticity Gap in Human Evaluation
The Authenticity Gap in Human Evaluation
Kawin Ethayarajh
Dan Jurafsky
81
24
0
24 May 2022
"I'm sorry to hear that": Finding New Biases in Language Models with a
  Holistic Descriptor Dataset
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset
Eric Michael Smith
Melissa Hall
Melanie Kambadur
Eleonora Presani
Adina Williams
76
129
0
18 May 2022
Near-Negative Distinction: Giving a Second Life to Human Evaluation
  Datasets
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
Philippe Laban
Chien-Sheng Wu
Wenhao Liu
Caiming Xiong
35
5
0
13 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
99
26
0
13 May 2022
UL2: Unifying Language Learning Paradigms
UL2: Unifying Language Learning Paradigms
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
57
296
0
10 May 2022
Vector Representations of Idioms in Conversational Systems
Vector Representations of Idioms in Conversational Systems
Tosin P. Adewumi
F. Liwicki
Marcus Liwicki
30
8
0
07 May 2022
When a sentence does not introduce a discourse entity, Transformer-based
  models still sometimes refer to it
When a sentence does not introduce a discourse entity, Transformer-based models still sometimes refer to it
Sebastian Schuster
Tal Linzen
13
25
0
06 May 2022
State-of-the-art in Open-domain Conversational AI: A Survey
State-of-the-art in Open-domain Conversational AI: A Survey
Tosin P. Adewumi
F. Liwicki
Marcus Liwicki
26
15
0
02 May 2022
12
Next