ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.06935
  4. Cited By
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation
  Practices for Generated Text

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

14 February 2022
Sebastian Gehrmann
Elizabeth Clark
Thibault Sellam
    ELM
    AI4CE
ArXivPDFHTML

Papers citing "Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text"

50 / 67 papers shown
Title
SPHERE: An Evaluation Card for Human-AI Systems
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
50
0
0
24 Mar 2025
A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks
A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks
Elie Antoine
Frédéric Béchet
Géraldine Damnati
Philippe Langlais
56
1
0
29 Jan 2025
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim
Kyungjae Lee
Y. Jang
Ji Yong Cho
Gangwoo Kim
Minseok Cho
Moontae Lee
151
0
0
28 Jan 2025
Leveraging Entailment Judgements in Cross-Lingual Summarisation
Leveraging Entailment Judgements in Cross-Lingual Summarisation
Huajian Zhang
Laura Perez-Beltrachini
HILM
38
0
0
01 Aug 2024
Beyond Metrics: A Critical Analysis of the Variability in Large Language
  Model Evaluation Frameworks
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks
Marco AF Pimentel
Clément Christophe
Tathagata Raha
Prateek Munjal
Praveen K Kanithi
Shadab Khan
ELM
34
2
0
29 Jul 2024
MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers
MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers
Fang Guo
Wenyu Li
Honglei Zhuang
Yun Luo
Yafu Li
Qi Zhu
Le Yan
Yue Zhang
ALM
70
6
0
18 Apr 2024
How Much Annotation is Needed to Compare Summarization Models?
How Much Annotation is Needed to Compare Summarization Models?
Chantal Shaib
Joe Barrow
Alexa F. Siu
Byron C. Wallace
A. Nenkova
53
2
0
28 Feb 2024
Fine-Grained Natural Language Inference Based Faithfulness Evaluation
  for Diverse Summarisation Tasks
Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks
Huajian Zhang
Yumo Xu
Laura Perez-Beltrachini
HILM
32
9
0
27 Feb 2024
SaGE: Evaluating Moral Consistency in Large Language Models
SaGE: Evaluating Moral Consistency in Large Language Models
Vamshi Krishna Bonagiri
Sreeram Vennam
Priyanshul Govil
Ponnurangam Kumaraguru
Manas Gaur
ELM
56
0
0
21 Feb 2024
Event-Keyed Summarization
Event-Keyed Summarization
William Gantt
Alexander Martin
Pavlo Kuchmiichuk
Aaron Steven White
22
1
0
10 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
60
29
0
02 Feb 2024
Mitigating Open-Vocabulary Caption Hallucinations
Mitigating Open-Vocabulary Caption Hallucinations
Assaf Ben-Kish
Moran Yanuka
Morris Alper
Raja Giryes
Hadar Averbuch-Elor
MLLM
VLM
23
6
0
06 Dec 2023
X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented
  Instruction Tuning with Auxiliary Evaluation Aspects
X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects
Minqian Liu
Ying Shen
Zhiyang Xu
Yixin Cao
Eunah Cho
Vaibhav Kumar
Reza Ghanadan
Lifu Huang
ELM
LM&MA
ALM
49
25
0
15 Nov 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
96
11,007
0
18 Jul 2023
Mini-Giants: "Small" Language Models and Open Source Win-Win
Mini-Giants: "Small" Language Models and Open Source Win-Win
Zhengping Zhou
Lezhi Li
Xinxi Chen
Andy Li
SyDa
ALM
MoE
26
6
0
17 Jul 2023
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
  Question Answering
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Pei Ke
Fei Huang
Fei Mi
Yasheng Wang
Qun Liu
Xiaoyan Zhu
Minlie Huang
ReLM
ELM
36
10
0
13 Jul 2023
Natural Language Generation for Advertising: A Survey
Natural Language Generation for Advertising: A Survey
Soichiro Murakami
Sho Hoshino
Peinan Zhang
14
10
0
22 Jun 2023
AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap
AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap
Q. V. Liao
J. Vaughan
38
158
0
02 Jun 2023
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Q. V. Liao
Ziang Xiao
ALM
ELM
47
29
0
01 Jun 2023
A Critical Evaluation of Evaluations for Long-form Question Answering
A Critical Evaluation of Evaluations for Long-form Question Answering
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
ELM
37
96
0
29 May 2023
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Generating EDU Extracts for Plan-Guided Summary Re-Ranking
Griffin Adams
Alexander R. Fabbri
Faisal Ladhak
Kathleen McKeown
Noémie Elhadad
18
10
0
28 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in
  Benchmarks
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
28
5
0
17 May 2023
What are the Desired Characteristics of Calibration Sets? Identifying
  Correlates on Long Form Scientific Summarization
What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization
Griffin Adams
Bichlien H. Nguyen
Jake A. Smith
Yingce Xia
Shufang Xie
Anna Ostropolets
Budhaditya Deb
Yuan Chen
Tristan Naumann
Noémie Elhadad
30
8
0
12 May 2023
Evaluating the Efficacy of Length-Controllable Machine Translation
Evaluating the Efficacy of Length-Controllable Machine Translation
Hao Cheng
Meng Zhang
Weixuan Wang
Liangyou Li
Qun Liu
Zhihua Zhang
32
0
0
03 May 2023
Large language models effectively leverage document-level context for
  literary translation, but critical errors persist
Large language models effectively leverage document-level context for literary translation, but critical errors persist
Marzena Karpinska
Mohit Iyyer
31
82
0
06 Apr 2023
BloombergGPT: A Large Language Model for Finance
BloombergGPT: A Large Language Model for Finance
Shijie Wu
Ozan Irsoy
Steven Lu
Vadim Dabravolski
Mark Dredze
Sebastian Gehrmann
P. Kambadur
David S. Rosenberg
Gideon Mann
AIFin
76
786
0
30 Mar 2023
Evaluating NLG systems: A brief introduction
Evaluating NLG systems: A brief introduction
Emiel van Miltenburg
31
0
0
29 Mar 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with
  Robust Human Evaluation
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu
Alexander R. Fabbri
Pengfei Liu
Yilun Zhao
Linyong Nan
...
Simeng Han
Shafiq R. Joty
Chien-Sheng Wu
Caiming Xiong
Dragomir R. Radev
ALM
24
132
0
15 Dec 2022
Follow the Wisdom of the Crowd: Effective Text Generation via Minimum
  Bayes Risk Decoding
Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding
Mirac Suzgun
Luke Melas-Kyriazi
Dan Jurafsky
24
43
0
14 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
116
2,310
0
09 Nov 2022
Dialect-robust Evaluation of Generated Text
Dialect-robust Evaluation of Generated Text
Jiao Sun
Thibault Sellam
Elizabeth Clark
Tu Vu
Timothy Dozat
Dan Garrette
Aditya Siddhant
Jacob Eisenstein
Sebastian Gehrmann
20
19
0
02 Nov 2022
TaTa: A Multilingual Table-to-Text Dataset for African Languages
TaTa: A Multilingual Table-to-Text Dataset for African Languages
Sebastian Gehrmann
Sebastian Ruder
Vitaly Nikolaev
Jan A. Botha
Michael Chavinda
Ankur P. Parikh
Clara E. Rivera
LMTD
21
10
0
31 Oct 2022
Questioning the Validity of Summarization Datasets and Improving Their
  Factual Consistency
Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency
Yanzhu Guo
Chloé Clavel
Moussa Kamal Eddine
Michalis Vazirgiannis
HILM
27
11
0
31 Oct 2022
Towards Interpretable Summary Evaluation via Allocation of Contextual
  Embeddings to Reference Text Topics
Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics
Ben Schaper
Christopher Lohse
Marcell Streile
Andrea Giovannini
Richard Osuala
16
1
0
25 Oct 2022
DEMETR: Diagnosing Evaluation Metrics for Translation
DEMETR: Diagnosing Evaluation Metrics for Translation
Marzena Karpinska
N. Raj
Katherine Thai
Yixiao Song
Ankita Gupta
Mohit Iyyer
23
37
0
25 Oct 2022
SafeText: A Benchmark for Exploring Physical Safety in Language Models
SafeText: A Benchmark for Exploring Physical Safety in Language Models
Sharon Levy
Emily Allaway
Melanie Subbiah
Lydia B. Chilton
D. Patton
Kathleen McKeown
William Yang Wang
59
40
0
18 Oct 2022
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
Yang Liu
Da Yin
Yuning Mao
Yizhu Jiao
Peng Liu
Chenguang Zhu
Heng Ji
Jiawei Han
ELM
45
254
0
13 Oct 2022
REV: Information-Theoretic Evaluation of Free-Text Rationales
REV: Information-Theoretic Evaluation of Free-Text Rationales
Hanjie Chen
Faeze Brahman
Xiang Ren
Yangfeng Ji
Yejin Choi
Swabha Swayamdipta
89
23
0
10 Oct 2022
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
Alex Jinpeng Wang
Richard Yuanzhe Pang
Angelica Chen
Jason Phang
Samuel R. Bowman
74
44
0
23 May 2022
RankGen: Improving Text Generation with Large Ranking Models
RankGen: Improving Text Generation with Large Ranking Models
Kalpesh Krishna
Yapei Chang
John Wieting
Mohit Iyyer
AIMat
24
68
0
19 May 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and
  Their Implications
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Kaitlyn Zhou
Su Lin Blodgett
Adam Trischler
Hal Daumé
Kaheer Suleman
Alexandra Olteanu
ELM
99
26
0
13 May 2022
UL2: Unifying Language Learning Paradigms
UL2: Unifying Language Learning Paradigms
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
57
296
0
10 May 2022
deep-significance - Easy and Meaningful Statistical Significance Testing
  in the Age of Neural Networks
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks
Dennis Ulmer
Christian Hardmeier
J. Frellsen
48
42
0
14 Apr 2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
AAML
ELM
22
20
0
21 Mar 2022
Adaptive Sampling Strategies to Construct Equitable Training Datasets
Adaptive Sampling Strategies to Construct Equitable Training Datasets
William Cai
R. Encarnación
Bobbie Chern
S. Corbett-Davies
Miranda Bogen
Stevie Bergman
Sharad Goel
81
30
0
31 Jan 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
367
8,495
0
28 Jan 2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai
Keisuke Sakaguchi
Ronan Le Bras
Lavinia Dunagan
Jacob Morrison
Alexander R. Fabbri
Yejin Choi
Noah A. Smith
51
39
0
08 Dec 2021
NL-Augmenter: A Framework for Task-Sensitive Natural Language
  Augmentation
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh D. Dhole
Varun Gangal
Sebastian Gehrmann
Aadesh Gupta
Zhenhao Li
...
Tianbao Xie
Usama Yaseen
Michael A. Yee
Jing Zhang
Yue Zhang
174
86
0
06 Dec 2021
A Survey of NLP-Related Crowdsourcing HITs: what works and what does not
A Survey of NLP-Related Crowdsourcing HITs: what works and what does not
Jessica Huynh
Jeffrey P. Bigham
M. Eskénazi
46
18
0
09 Nov 2021
A Framework for Deprecating Datasets: Standardizing Documentation,
  Identification, and Communication
A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication
A. Luccioni
Frances Corry
H. Sridharan
Mike Ananny
J. Schultz
Kate Crawford
54
29
0
18 Oct 2021
12
Next