ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.04048
  4. Cited By
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
v1v2v3 (latest)

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

7 March 2023
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
    LM&MAELMALMAI4MH
ArXiv (abs)PDFHTML

Papers citing "Is ChatGPT a Good NLG Evaluator? A Preliminary Study"

50 / 307 papers shown
Title
Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
Borui Xu
Yao Chen
Zeyi Wen
Weiguo Liu
Bingsheng He
188
2
0
02 Feb 2025
Learning to Summarize from LLM-generated Feedback
Learning to Summarize from LLM-generated Feedback
Hwanjun Song
Taewon Yun
Yuho Lee
Jihwan Oh
Gihun Lee
Jason (Jinglun) Cai
Hang Su
225
10
0
28 Jan 2025
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Yinhong Liu
Han Zhou
Zhijiang Guo
Ehsan Shareghi
Ivan Vulić
Anna Korhonen
Nigel Collier
ALM
207
83
0
20 Jan 2025
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
Shunfan Zheng
Xiechi Zhang
Gerard de Melo
Xiaoling Wang
Linlin Wang
LM&MAELM
49
1
0
12 Jan 2025
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Justin Vasselli
Adam Nohejl
Taro Watanabe
AAML
80
0
0
12 Jan 2025
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Dong Yuan
Eti Rastogi
Fen Zhao
Sagar Goyal
Gautam Naik
Sree Prasanna Rajagopal
63
0
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILawLM&MALRM
145
30
0
31 Dec 2024
Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models
Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models
Chengyan Wu
Bolei Ma
Zheyu Zhang
Ningyuan Deng
Yanqing He
Yun Xue
LRM
122
1
0
17 Dec 2024
ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models
ACE-M3M^3M3: Automatic Capability Evaluator for Multimodal Medical Models
Xiechi Zhang
Shunfan Zheng
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Liang He
ELM
149
0
0
16 Dec 2024
Explingo: Explaining AI Predictions using Large Language Models
Explingo: Explaining AI Predictions using Large Language Models
Alexandra Zytek
Sara Pido
Sarah Alnegheimish
Laure Berti-Equille
K. Veeramachaneni
121
1
0
06 Dec 2024
Do Automatic Factuality Metrics Measure Factuality? A Critical
  Evaluation
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
S. Ramprasad
Byron C. Wallace
LLMAGHILM
146
3
0
25 Nov 2024
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for
  reference-free open-ended text
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
Reshmi Ghosh
Tianyi Yao
Lizzy Chen
Sadid Hasan
Tianwei Chen
Dario Bernal
Huitian Jiao
H M Sajjad Hossain
ELM
117
0
0
25 Nov 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
99
6
0
07 Nov 2024
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma
Jiongxiao Wang
Fei Wang
Siyuan Ma
Jiazhao Li
...
B. Li
Yejin Choi
Mengzhao Chen
Chaowei Xiao
Chaowei Xiao
MU
131
10
0
05 Nov 2024
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALMLRM
184
0
0
03 Nov 2024
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of
  Large Language Models
Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
Do Xuan Long
Duong Ngoc Yen
Anh Tuan Luu
Kenji Kawaguchi
Min-Yen Kan
Nancy F. Chen
KELMELMLRM
117
7
0
01 Nov 2024
On Positional Bias of Faithfulness for Long-form Summarization
On Positional Bias of Faithfulness for Long-form Summarization
David Wan
Jesse Vig
Joey Tianyi Zhou
Shafiq Joty
HILM
100
8
0
31 Oct 2024
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Zhihao Liu
Chenhui Hu
ALMELM
75
1
0
29 Oct 2024
Evaluating LLMs for Targeted Concept Simplification for Domain-Specific
  Texts
Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts
Sumit Asthana
Hannah Rashkin
Elizabeth Clark
Fantine Huot
Mirella Lapata
76
1
0
28 Oct 2024
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using
  Language Models
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman
Yoonjoo Lee
Aakanksha Naik
Pao Siangliulue
Raymond Fok
Juho Kim
Daniel S. Weld
Joseph Chee Chang
Kyle Lo
LMTD
146
4
0
25 Oct 2024
An Auditing Test To Detect Behavioral Shift in Language Models
An Auditing Test To Detect Behavioral Shift in Language Models
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
95
0
0
25 Oct 2024
Optimizing the role of human evaluation in LLM-based spoken document
  summarization systems
Optimizing the role of human evaluation in LLM-based spoken document summarization systems
Margaret Kroll
Kelsey Kraus
28
2
0
23 Oct 2024
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense
  Assessment Items
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
Melissa Roemmele
Andrew S. Gordon
62
2
0
18 Oct 2024
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
Xiaonan Jing
Srinivas Billa
Danny Godbout
HILM
125
0
0
16 Oct 2024
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Yi-Fan Lu
Xian-Ling Mao
Tian Lan
Heyan Huang
Heyan Huang
Xiaoyan Gao
85
0
0
12 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
138
2
0
11 Oct 2024
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui
Yufei He
Zifeng Ding
Bryan Hooi
HILMRALMELM
150
10
0
10 Oct 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye
Yanbo Wang
Yue Huang
Dongping Chen
Qihui Zhang
...
Werner Geyer
Chao Huang
Pin-Yu Chen
Nitesh Chawla
Xiangliang Zhang
ELM
128
78
0
03 Oct 2024
AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human
  Summarization Preference
AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference
Yang Han
Yiming Wang
Rui Wang
Lu Chen
Kai Yu
AI4TSALM
57
2
0
01 Oct 2024
A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K
  Filings Using Large Language Models
A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K Filings Using Large Language Models
Syed Affan Daimi
Asma Iqbal
64
1
0
26 Sep 2024
AXCEL: Automated eXplainable Consistency Evaluation using LLMs
AXCEL: Automated eXplainable Consistency Evaluation using LLMs
P Aditya Sreekar
Sahil Verma
Suransh Chopra
Sarik Ghazarian
Abhishek Persad
Narayanan Sadagopan
LRM
42
1
0
25 Sep 2024
Direct Judgement Preference Optimization
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
109
13
0
23 Sep 2024
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on
  Curiosity-Driven Questioning
What Would You Ask When You First Saw a2+b2=c2a^2+b^2=c^2a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning
Shashidhar Reddy Javaji
Zining Zhu
ELMALM
63
1
0
19 Sep 2024
CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation
  for Meeting Summarization
CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization
Ziwei Gong
Lin Ai
Harshsaiprasad Deshpande
Alexander Johnson
Emmy Phung
Zehui Wu
Ahmad Emami
Julia Hirschberg
105
2
0
17 Sep 2024
A Dataset for Evaluating LLM-based Evaluation Functions for Research
  Question Extraction Task
A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
Yuya Fujisaki
Shiro Takagi
Hideki Asoh
Wataru Kumagai
69
0
0
10 Sep 2024
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller
António Loison
Bilel Omrani
Gautier Viaud
RALMELM
108
2
0
10 Sep 2024
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Zhi Chen
Qiguang Chen
Libo Qin
Qipeng Guo
Haijun Lv
Yicheng Zou
Wanxiang Che
Hang Yan
Kai Chen
Dahua Lin
SyDa
126
4
0
03 Sep 2024
XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model
XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model
Yasir Ali Farrukh
S. Wali
I. Khan
Nathaniel D. Bastian
468
3
0
27 Aug 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive
  Survey of Story Evaluation
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
130
7
0
26 Aug 2024
DHP Benchmark: Are LLMs Good NLG Evaluators?
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
Jiayi Yuan
Yu-Neng Chuang
Zhuoer Wang
Yingchi Liu
Mark Cusick
Param Kulkarni
Zhengping Ji
Yasser Ibrahim
Xia Hu
LM&MAELM
123
4
0
25 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALMELM
194
32
0
23 Aug 2024
VERA: Validation and Evaluation of Retrieval-Augmented Systems
VERA: Validation and Evaluation of Retrieval-Augmented Systems
Tianyu Ding
Adi Banerjee
Laurent Mombaerts
Yunhong Li
Tarik Borogovac
Juan Pablo De la Cruz Weinstein
65
2
0
16 Aug 2024
Automated Educational Question Generation at Different Bloom's Skill
  Levels using Large Language Models: Strategies and Evaluation
Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation
Nicy Scaria
S. Chenna
Deepak N. Subramani
ELMAI4Ed
73
12
0
08 Aug 2024
Self-Taught Evaluators
Self-Taught Evaluators
Tianlu Wang
Ilia Kulikov
O. Yu. Golovneva
Ping Yu
Weizhe Yuan
Jane Dwivedi-Yu
Richard Yuanzhe Pang
Maryam Fazel-Zarandi
Jason Weston
Xian Li
ALMLRM
81
27
0
05 Aug 2024
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in
  Grammatical Error Detection
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection
Gaetan Lopez Latouche
M. Carbonneau
Ben Swanson
83
0
0
16 Jul 2024
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
Nuo Chen
Yan Wang
Yang Deng
Jia Li
120
21
0
16 Jul 2024
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
Anni Zou
Wenhao Yu
Hongming Zhang
Kaixin Ma
Deng Cai
Zhuosheng Zhang
Hai Zhao
Dong Yu
78
9
0
15 Jul 2024
Enhancing Emotion Prediction in News Headlines: Insights from ChatGPT
  and Seq2Seq Models for Free-Text Generation
Enhancing Emotion Prediction in News Headlines: Insights from ChatGPT and Seq2Seq Models for Free-Text Generation
Ge Gao
Jongin Kim
Sejin Paik
Ekaterina Novozhilova
Yi Liu
Sarah Bonna
Margrit Betke
Derry Wijaya
89
1
0
14 Jul 2024
Virtual Personas for Language Models via an Anthology of Backstories
Virtual Personas for Language Models via an Anthology of Backstories
Suhong Moon
Marwa Abdulhai
Minwoo Kang
Joseph Suh
Widyadewi Soedarmadji
Eran Kohen Behar
David M. Chan
88
15
0
09 Jul 2024
Source Code Summarization in the Era of Large Language Models
Source Code Summarization in the Era of Large Language Models
Weisong Sun
Yun Miao
Yuekang Li
Hongyu Zhang
Chunrong Fang
Yi Liu
Gelei Deng
Yang Liu
Zhenyu Chen
ELM
144
18
0
09 Jul 2024
Previous
1234567
Next