ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.04048
  4. Cited By
Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

7 March 2023
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
    LM&MA
    ELM
    ALM
    AI4MH
ArXivPDFHTML

Papers citing "Is ChatGPT a Good NLG Evaluator? A Preliminary Study"

50 / 288 papers shown
Title
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using
  Language Models
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman
Yoonjoo Lee
Aakanksha Naik
Pao Siangliulue
Raymond Fok
Juho Kim
Daniel S. Weld
Joseph Chee Chang
Kyle Lo
LMTD
25
3
0
25 Oct 2024
Optimizing the role of human evaluation in LLM-based spoken document
  summarization systems
Optimizing the role of human evaluation in LLM-based spoken document summarization systems
Margaret Kroll
Kelsey Kraus
19
2
0
23 Oct 2024
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense
  Assessment Items
From Test-Taking to Test-Making: Examining LLM Authoring of Commonsense Assessment Items
Melissa Roemmele
Andrew S. Gordon
32
1
0
18 Oct 2024
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
Xiaonan Jing
Srinivas Billa
Danny Godbout
HILM
42
0
0
16 Oct 2024
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Yi-Fan Lu
Xian-Ling Mao
Tian Lan
Heyan Huang
Heyan Huang
Xiaoyan Gao
55
0
0
12 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
79
2
0
11 Oct 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye
Yanbo Wang
Yue Huang
Dongping Chen
Qihui Zhang
...
Werner Geyer
Chao Huang
Pin-Yu Chen
Nitesh V. Chawla
Xiangliang Zhang
ELM
40
45
0
03 Oct 2024
AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human
  Summarization Preference
AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference
Yang Han
Yiming Wang
Rui-cang Wang
Lu Chen
Kai Yu
AI4TS
ALM
24
1
0
01 Oct 2024
A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K
  Filings Using Large Language Models
A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K Filings Using Large Language Models
Syed Affan Daimi
Asma Iqbal
36
1
0
26 Sep 2024
AXCEL: Automated eXplainable Consistency Evaluation using LLMs
AXCEL: Automated eXplainable Consistency Evaluation using LLMs
P Aditya Sreekar
Sahil Verma
Suransh Chopra
Sarik Ghazarian
Abhishek Persad
Narayanan Sadagopan
LRM
33
0
0
25 Sep 2024
Direct Judgement Preference Optimization
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
39
12
0
23 Sep 2024
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on
  Curiosity-Driven Questioning
What Would You Ask When You First Saw a2+b2=c2a^2+b^2=c^2a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning
Shashidhar Reddy Javaji
Zining Zhu
ELM
ALM
39
0
0
19 Sep 2024
CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation
  for Meeting Summarization
CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization
Ziwei Gong
Lin Ai
Harshsaiprasad Deshpande
Alexander Johnson
Emmy Phung
Zehui Wu
Ahmad Emami
Julia Hirschberg
41
2
0
17 Sep 2024
A Dataset for Evaluating LLM-based Evaluation Functions for Research
  Question Extraction Task
A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
Yuya Fujisaki
Shiro Takagi
Hideki Asoh
Wataru Kumagai
23
0
0
10 Sep 2024
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller
António Loison
Bilel Omrani
Gautier Viaud
RALM
ELM
38
1
0
10 Sep 2024
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Zhi Chen
Qiguang Chen
Libo Qin
Qipeng Guo
Haijun Lv
Yicheng Zou
Wanxiang Che
Hang Yan
K. Chen
Dahua Lin
SyDa
53
4
0
03 Sep 2024
XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model
XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model
Yasir Ali Farrukh
S. Wali
I. Khan
Nathaniel D. Bastian
147
2
0
27 Aug 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive
  Survey of Story Evaluation
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
44
5
0
26 Aug 2024
DHP Benchmark: Are LLMs Good NLG Evaluators?
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
Jiayi Yuan
Yu-Neng Chuang
Zhuoer Wang
Yingchi Liu
Mark Cusick
Param Kulkarni
Zhengping Ji
Yasser Ibrahim
Xia Hu
LM&MA
ELM
49
3
0
25 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
64
23
0
23 Aug 2024
VERA: Validation and Evaluation of Retrieval-Augmented Systems
VERA: Validation and Evaluation of Retrieval-Augmented Systems
Tianyu Ding
Adi Banerjee
Laurent Mombaerts
Yunhong Li
Tarik Borogovac
Juan Pablo De la Cruz Weinstein
29
2
0
16 Aug 2024
Automated Educational Question Generation at Different Bloom's Skill
  Levels using Large Language Models: Strategies and Evaluation
Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation
Nicy Scaria
S. Chenna
Deepak N. Subramani
ELM
AI4Ed
30
7
0
08 Aug 2024
Self-Taught Evaluators
Self-Taught Evaluators
Tianlu Wang
Ilia Kulikov
O. Yu. Golovneva
Ping Yu
Weizhe Yuan
Jane Dwivedi-Yu
Richard Yuanzhe Pang
Maryam Fazel-Zarandi
Jason Weston
Xian Li
ALM
LRM
29
22
0
05 Aug 2024
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in
  Grammatical Error Detection
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection
Gaetan Lopez Latouche
M. Carbonneau
Ben Swanson
27
0
0
16 Jul 2024
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
Nuo Chen
Yan Wang
Yang Deng
Jia Li
35
15
0
16 Jul 2024
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
Anni Zou
Wenhao Yu
Hongming Zhang
Kaixin Ma
Deng Cai
Zhuosheng Zhang
Hai Zhao
Dong Yu
41
6
0
15 Jul 2024
Enhancing Emotion Prediction in News Headlines: Insights from ChatGPT
  and Seq2Seq Models for Free-Text Generation
Enhancing Emotion Prediction in News Headlines: Insights from ChatGPT and Seq2Seq Models for Free-Text Generation
Ge Gao
Jongin Kim
Sejin Paik
Ekaterina Novozhilova
Yi Liu
Sarah Bonna
Margrit Betke
Derry Wijaya
41
0
0
14 Jul 2024
Virtual Personas for Language Models via an Anthology of Backstories
Virtual Personas for Language Models via an Anthology of Backstories
Suhong Moon
Marwa Abdulhai
Minwoo Kang
Joseph Suh
Widyadewi Soedarmadji
Eran Kohen Behar
David M. Chan
49
11
0
09 Jul 2024
Source Code Summarization in the Era of Large Language Models
Source Code Summarization in the Era of Large Language Models
Dongrui Liu
Yun Miao
Yuekang Li
Hongyu Zhang
Chunrong Fang
Yi Liu
Gelei Deng
Yang Liu
Zhenyu Chen
ELM
55
14
0
09 Jul 2024
Exploring the Capability of ChatGPT to Reproduce Human Labels for Social
  Computing Tasks (Extended Version)
Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)
Yiming Zhu
Peixian Zhang
Ehsan-ul Haq
Pan Hui
Gareth Tyson
ALM
AI4MH
47
0
0
08 Jul 2024
On Evaluating The Performance of Watermarked Machine-Generated Texts
  Under Adversarial Attacks
On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks
Zesen Liu
Tianshuo Cong
Xinlei He
Qi Li
AAML
WaLM
50
1
0
05 Jul 2024
EventChat: Implementation and user-centric evaluation of a large
  language model-driven conversational recommender system for exploring leisure
  events in an SME context
EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
Hannes Kunstmann
J. Ollier
Joel Persson
F. Wangenheim
37
0
0
05 Jul 2024
Waterfall: Framework for Robust and Scalable Text Watermarking
Waterfall: Framework for Robust and Scalable Text Watermarking
Gregory Kang Ruey Lau
Xinyuan Niu
Hieu Dao
Jiangwei Chen
Chuan-Sheng Foo
Bryan Kian Hsiang Low
WaLM
41
6
0
05 Jul 2024
Human-Centered Design Recommendations for LLM-as-a-Judge
Human-Centered Design Recommendations for LLM-as-a-Judge
Qian Pan
Zahra Ashktorab
Michael Desmond
Martin Santillan Cooper
James M. Johnson
Rahul Nair
Elizabeth M. Daly
Werner Geyer
ELM
ALM
39
17
0
03 Jul 2024
Free-text Rationale Generation under Readability Level Control
Free-text Rationale Generation under Readability Level Control
Yi-Sheng Hsu
Nils Feldhus
Sherzod Hakimov
40
0
0
01 Jul 2024
Hybrid RAG-empowered Multi-modal LLM for Secure Healthcare Data
  Management: A Diffusion-based Contract Theory Approach
Hybrid RAG-empowered Multi-modal LLM for Secure Healthcare Data Management: A Diffusion-based Contract Theory Approach
Cheng Su
Jinbo Wen
Jiawen Kang
Yonghua Wang
Hudan Pan
M. S. Hossain
MedIm
21
0
0
01 Jul 2024
FineSurE: Fine-grained Summarization Evaluation using LLMs
FineSurE: Fine-grained Summarization Evaluation using LLMs
Hwanjun Song
Hang Su
Igor Shalyminov
Jason (Jinglun) Cai
Saab Mansour
HILM
41
31
0
01 Jul 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences
  to Reduce Harm
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
B. Ermiş
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
40
28
0
26 Jun 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20
  NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
54
62
0
26 Jun 2024
Themis: Towards Flexible and Interpretable NLG Evaluation
Themis: Towards Flexible and Interpretable NLG Evaluation
Xinyu Hu
Li Lin
Mingqi Gao
Xunjian Yin
Xiaojun Wan
ELM
34
6
0
26 Jun 2024
ConvoCache: Smart Re-Use of Chatbot Responses
ConvoCache: Smart Re-Use of Chatbot Responses
Conor Atkins
Ian D. Wood
M. Kâafar
Hassan Jameel Asghar
Nardine Basta
Michal Kepkowski
33
0
0
26 Jun 2024
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a
  Feature Decorrelation Perspective
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
Hanqi Yan
Yanzheng Xiang
Guangyi Chen
Yifei Wang
Lin Gui
Yulan He
41
5
0
25 Jun 2024
CausalScore: An Automatic Reference-Free Metric for Assessing Response
  Relevance in Open-Domain Dialogue Systems
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Tao Feng
Lizhen Qu
Xiaoxi Kang
Gholamreza Haffari
38
1
0
25 Jun 2024
C-LLM: Learn to Check Chinese Spelling Errors Character by Character
C-LLM: Learn to Check Chinese Spelling Errors Character by Character
Kunting Li
Yong Hu
Liang He
Fandong Meng
Jie Zhou
37
7
0
24 Jun 2024
A LLM-Based Ranking Method for the Evaluation of Automatic
  Counter-Narrative Generation
A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
I. Zubiaga
A. Soroa
Rodrigo Agerri
39
4
0
21 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
42
11
0
19 Jun 2024
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM
  Framework for Detecting Factual Errors
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
Alex Chandler
Devesh Surve
Hui Su
HILM
UQCV
31
1
0
18 Jun 2024
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method
  using GPT-4
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4
Ming Gu
Yan Yang
21
0
0
17 Jun 2024
AIM: Let Any Multi-modal Large Language Models Embrace Efficient
  In-Context Learning
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
Jun Gao
Qian Qiao
Ziqiang Cao
Zili Wang
Wenjie Li
34
3
0
11 Jun 2024
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak
  Attacks
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
Zonghao Ying
Aishan Liu
Xianglong Liu
Dacheng Tao
62
16
0
10 Jun 2024
Previous
123456
Next