Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.09766
Cited By
v1
v2
v3
v4 (latest)
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
16 November 2023
Yiqi Liu
N. Moosavi
Chenghua Lin
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores"
27 / 27 papers shown
Title
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
116
0
0
16 Apr 2025
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Wenwen Xie
Gray Gwizdz
Dongji Feng
124
0
0
20 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
142
29
0
03 Feb 2025
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei Zhao
Steffen Eger
123
10
0
24 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
109
10
0
17 Oct 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
87
5
0
06 Sep 2024
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Lifan Jiang
Zhihui Wang
Siqi Yin
Guangxiao Ma
Peng Zhang
Boxi Wu
DiffM
135
0
0
28 Aug 2024
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
106
45
0
20 Dec 2022
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
Yiwei Qin
Weizhe Yuan
Graham Neubig
Pengfei Liu
61
23
0
12 Dec 2022
How Far are We from Robust Long Abstractive Summarization?
Huan Yee Koh
Jiaxin Ju
He Zhang
Ming Liu
Shirui Pan
HILM
96
40
0
30 Oct 2022
On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch
Rotem Dror
Dan Roth
111
47
0
22 Oct 2022
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
Yang Liu
Da Yin
Yuning Mao
Yizhu Jiao
Peng Liu
Chenguang Zhu
Heng Ji
Jiawei Han
ELM
85
276
0
13 Oct 2022
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization
Ryo Kamoi
Tanya Goyal
Greg Durrett
HILM
62
14
0
13 Oct 2022
SMART: Sentences as Basic Units for Text Evaluation
Reinald Kim Amplayo
Peter J. Liu
Yao-Min Zhao
Shashi Narayan
62
22
0
01 Aug 2022
Spurious Correlations in Reference-Free Evaluation of Text Generation
Esin Durmus
Faisal Ladhak
Tatsunori Hashimoto
62
31
0
21 Apr 2022
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
146
58
0
13 Sep 2021
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
123
849
0
22 Jun 2021
Q
2
Q^{2}
Q
2
: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Or Honovich
Leshem Choshen
Roee Aharoni
Ella Neeman
Idan Szpektor
Omri Abend
HILM
78
141
0
16 Apr 2021
QuestEval: Summarization Asks for Fact-based Evaluation
Thomas Scialom
Paul-Alexis Dray
Patrick Gallinari
Sylvain Lamprier
Benjamin Piwowarski
Jacopo Staiano
Alex Jinpeng Wang
HILM
64
276
0
23 Mar 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
311
285
0
02 Feb 2021
Re-evaluating Evaluation in Text Summarization
Manik Bhandari
Pranav Narayan Gour
A. Ashfaq
Pengfei Liu
Graham Neubig
150
178
0
14 Oct 2020
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
Yang Gao
Wei Zhao
Steffen Eger
ELM
92
126
0
07 May 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
485
20,317
0
23 Oct 2019
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao
Maxime Peyrard
Fei Liu
Yang Gao
Christian M. Meyer
Steffen Eger
187
602
0
05 Sep 2019
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
352
5,868
0
21 Apr 2019
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
Shashi Narayan
Shay B. Cohen
Mirella Lapata
AILaw
146
1,683
0
27 Aug 2018
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
351
3,553
0
10 Jun 2015
1