ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.09766
  4. Cited By
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
v1v2v3v4 (latest)

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

16 November 2023
Yiqi Liu
N. Moosavi
Chenghua Lin
    ELM
ArXiv (abs)PDFHTML

Papers citing "LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores"

27 / 27 papers shown
Title
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
116
0
0
16 Apr 2025
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Wenwen Xie
Gray Gwizdz
Dongji Feng
124
0
0
20 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
142
29
0
03 Feb 2025
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei Zhao
Steffen Eger
123
10
0
24 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELMALM
109
10
0
17 Oct 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
87
5
0
06 Sep 2024
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation
  Strategy of Consistency Model
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Lifan Jiang
Zhihui Wang
Siqi Yin
Guangxiao Ma
Peng Zhang
Boxi Wu
DiffM
135
0
0
28 Aug 2024
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He
Jingyu Zhang
Tianle Wang
Sachin Kumar
Kyunghyun Cho
James R. Glass
Yulia Tsvetkov
106
45
0
20 Dec 2022
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
Yiwei Qin
Weizhe Yuan
Graham Neubig
Pengfei Liu
61
23
0
12 Dec 2022
How Far are We from Robust Long Abstractive Summarization?
How Far are We from Robust Long Abstractive Summarization?
Huan Yee Koh
Jiaxin Ju
He Zhang
Ming Liu
Shirui Pan
HILM
96
40
0
30 Oct 2022
On the Limitations of Reference-Free Evaluations of Generated Text
On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch
Rotem Dror
Dan Roth
111
47
0
22 Oct 2022
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
Yang Liu
Da Yin
Yuning Mao
Yizhu Jiao
Peng Liu
Chenguang Zhu
Heng Ji
Jiawei Han
ELM
85
276
0
13 Oct 2022
Shortcomings of Question Answering Based Factuality Frameworks for Error
  Localization
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization
Ryo Kamoi
Tanya Goyal
Greg Durrett
HILM
62
14
0
13 Oct 2022
SMART: Sentences as Basic Units for Text Evaluation
SMART: Sentences as Basic Units for Text Evaluation
Reinald Kim Amplayo
Peter J. Liu
Yao-Min Zhao
Shashi Narayan
62
22
0
01 Aug 2022
Spurious Correlations in Reference-Free Evaluation of Text Generation
Spurious Correlations in Reference-Free Evaluation of Text Generation
Esin Durmus
Faisal Ladhak
Tatsunori Hashimoto
62
31
0
21 Apr 2022
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
146
58
0
13 Sep 2021
BARTScore: Evaluating Generated Text as Text Generation
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
123
849
0
22 Jun 2021
$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues
  via Question Generation and Question Answering
Q2Q^{2}Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Or Honovich
Leshem Choshen
Roee Aharoni
Ella Neeman
Idan Szpektor
Omri Abend
HILM
78
141
0
16 Apr 2021
QuestEval: Summarization Asks for Fact-based Evaluation
QuestEval: Summarization Asks for Fact-based Evaluation
Thomas Scialom
Paul-Alexis Dray
Patrick Gallinari
Sylvain Lamprier
Benjamin Piwowarski
Jacopo Staiano
Alex Jinpeng Wang
HILM
64
276
0
23 Mar 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
311
285
0
02 Feb 2021
Re-evaluating Evaluation in Text Summarization
Re-evaluating Evaluation in Text Summarization
Manik Bhandari
Pranav Narayan Gour
A. Ashfaq
Pengfei Liu
Graham Neubig
150
178
0
14 Oct 2020
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for
  Multi-Document Summarization
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
Yang Gao
Wei Zhao
Steffen Eger
ELM
92
126
0
07 May 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
485
20,317
0
23 Oct 2019
MoverScore: Text Generation Evaluating with Contextualized Embeddings
  and Earth Mover Distance
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao
Maxime Peyrard
Fei Liu
Yang Gao
Christian M. Meyer
Steffen Eger
187
602
0
05 Sep 2019
BERTScore: Evaluating Text Generation with BERT
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
352
5,868
0
21 Apr 2019
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional
  Neural Networks for Extreme Summarization
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
Shashi Narayan
Shay B. Cohen
Mirella Lapata
AILaw
146
1,683
0
27 Aug 2018
Teaching Machines to Read and Comprehend
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
351
3,553
0
10 Jun 2015
1