ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.04048
  4. Cited By
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
v1v2v3 (latest)

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

7 March 2023
Jiaan Wang
Yunlong Liang
Fandong Meng
Zengkui Sun
Haoxiang Shi
Zhixu Li
Jinan Xu
Jianfeng Qu
Jie Zhou
    LM&MAELMALMAI4MH
ArXiv (abs)PDFHTML

Papers citing "Is ChatGPT a Good NLG Evaluator? A Preliminary Study"

50 / 307 papers shown
Title
Exploring the Capability of ChatGPT to Reproduce Human Labels for Social
  Computing Tasks (Extended Version)
Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)
Yiming Zhu
Peixian Zhang
Ehsan-ul Haq
Pan Hui
Gareth Tyson
ALMAI4MH
100
0
0
08 Jul 2024
On Evaluating The Performance of Watermarked Machine-Generated Texts
  Under Adversarial Attacks
On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks
Zesen Liu
Tianshuo Cong
Xinlei He
Qi Li
AAMLWaLM
115
1
0
05 Jul 2024
EventChat: Implementation and user-centric evaluation of a large
  language model-driven conversational recommender system for exploring leisure
  events in an SME context
EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
Hannes Kunstmann
J. Ollier
Joel Persson
F. Wangenheim
78
0
0
05 Jul 2024
Waterfall: Framework for Robust and Scalable Text Watermarking
Waterfall: Framework for Robust and Scalable Text Watermarking
Gregory Kang Ruey Lau
Xinyuan Niu
Hieu Dao
Jiangwei Chen
Chuan-Sheng Foo
Bryan Kian Hsiang Low
WaLM
82
6
0
05 Jul 2024
Human-Centered Design Recommendations for LLM-as-a-Judge
Human-Centered Design Recommendations for LLM-as-a-Judge
Qian Pan
Zahra Ashktorab
Michael Desmond
Martin Santillan Cooper
James M. Johnson
Rahul Nair
Elizabeth M. Daly
Werner Geyer
ELMALM
67
19
0
03 Jul 2024
Hybrid RAG-empowered Multi-modal LLM for Secure Healthcare Data
  Management: A Diffusion-based Contract Theory Approach
Hybrid RAG-empowered Multi-modal LLM for Secure Healthcare Data Management: A Diffusion-based Contract Theory Approach
Cheng Su
Jinbo Wen
Jiawen Kang
Yonghua Wang
Hudan Pan
M. S. Hossain
MedIm
45
0
0
01 Jul 2024
FineSurE: Fine-grained Summarization Evaluation using LLMs
FineSurE: Fine-grained Summarization Evaluation using LLMs
Hwanjun Song
Hang Su
Igor Shalyminov
Jason (Jinglun) Cai
Saab Mansour
HILM
85
36
0
01 Jul 2024
Free-text Rationale Generation under Readability Level Control
Free-text Rationale Generation under Readability Level Control
Yi-Sheng Hsu
Nils Feldhus
Sherzod Hakimov
112
2
0
01 Jul 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences
  to Reduce Harm
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
Beyza Ermis
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
119
39
0
26 Jun 2024
Themis: Towards Flexible and Interpretable NLG Evaluation
Themis: Towards Flexible and Interpretable NLG Evaluation
Xinyu Hu
Li Lin
Mingqi Gao
Xunjian Yin
Xiaojun Wan
ELM
91
8
0
26 Jun 2024
ConvoCache: Smart Re-Use of Chatbot Responses
ConvoCache: Smart Re-Use of Chatbot Responses
Conor Atkins
Ian D. Wood
M. Kâafar
Hassan Jameel Asghar
Nardine Basta
Michal Kepkowski
99
0
0
26 Jun 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALMELM
179
88
0
26 Jun 2024
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a
  Feature Decorrelation Perspective
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
Hanqi Yan
Yanzheng Xiang
Guangyi Chen
Yifei Wang
Lin Gui
Yulan He
117
5
0
25 Jun 2024
CausalScore: An Automatic Reference-Free Metric for Assessing Response
  Relevance in Open-Domain Dialogue Systems
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Tao Feng
Zhuang Li
Xiaoxi Kang
Gholamreza Haffari
67
1
0
25 Jun 2024
C-LLM: Learn to Check Chinese Spelling Errors Character by Character
C-LLM: Learn to Check Chinese Spelling Errors Character by Character
Kunting Li
Yong Hu
Liang He
Fandong Meng
Jie Zhou
89
9
0
24 Jun 2024
A LLM-Based Ranking Method for the Evaluation of Automatic
  Counter-Narrative Generation
A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
I. Zubiaga
A. Soroa
Rodrigo Agerri
74
6
0
21 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
107
16
0
19 Jun 2024
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM
  Framework for Detecting Factual Errors
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
Alex Chandler
Devesh Surve
Hui Su
HILMUQCV
58
1
0
18 Jun 2024
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method
  using GPT-4
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4
Ming Gu
Yan Yang
43
0
0
17 Jun 2024
AIM: Let Any Multi-modal Large Language Models Embrace Efficient
  In-Context Learning
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
Jun Gao
Qian Qiao
Ziqiang Cao
Zili Wang
Wenjie Li
80
3
0
11 Jun 2024
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak
  Attacks
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
Zonghao Ying
Aishan Liu
Xianglong Liu
Dacheng Tao
124
25
0
10 Jun 2024
On Subjective Uncertainty Quantification and Calibration in Natural
  Language Generation
On Subjective Uncertainty Quantification and Calibration in Natural Language Generation
Ziyu Wang
Chris Holmes
UQLM
162
7
0
07 Jun 2024
Large Language Models as Evaluators for Recommendation Explanations
Large Language Models as Evaluators for Recommendation Explanations
Xiaoyu Zhang
Yishan Li
Jiayin Wang
Bowen Sun
Weizhi Ma
Peijie Sun
Min Zhang
LRMELM
100
14
0
05 Jun 2024
Item-Language Model for Conversational Recommendation
Item-Language Model for Conversational Recommendation
Li Yang
Anushya Subbiah
Hardik Patel
Judith Yue Li
Yanwei Song
Reza Mirghaderi
Vikram Aggarwal
Qifan Wang
KELM
94
5
0
05 Jun 2024
XRec: Large Language Models for Explainable Recommendation
XRec: Large Language Models for Explainable Recommendation
Qiyao Ma
Xubin Ren
Chao Huang
LRM
87
23
0
04 Jun 2024
Guiding ChatGPT to Generate Salient Domain Summaries
Guiding ChatGPT to Generate Salient Domain Summaries
Jun Gao
Ziqiang Cao
Shaoyao Huang
Luozheng Qin
Chunhui Ai
105
1
0
03 Jun 2024
Multi-Dimensional Optimization for Text Summarization via Reinforcement
  Learning
Multi-Dimensional Optimization for Text Summarization via Reinforcement Learning
Sangwon Ryu
Heejin Do
Yunsu Kim
Gary Geunbae Lee
Jungseul Ok
97
3
0
01 Jun 2024
Towards Rationality in Language and Multimodal Agents: A Survey
Towards Rationality in Language and Multimodal Agents: A Survey
Bowen Jiang
Yangxinyu Xie
Xiaomeng Wang
Yuan Yuan
Camillo J Taylor
Tanwi Mallick
Weijie J. Su
Camillo J. Taylor
Tanwi Mallick
LLMAG
66
6
0
01 Jun 2024
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Phillip Howard
Kathleen C. Fraser
Anahita Bhiwandiwalla
S. Kiritchenko
146
13
0
30 May 2024
A Full-duplex Speech Dialogue Scheme Based On Large Language Models
A Full-duplex Speech Dialogue Scheme Based On Large Language Models
Peng Wang
Songshuo Lu
Yaohua Tang
Sijie Yan
Yuanjun Xiong
Wei Xia
AuLLM
83
16
0
29 May 2024
SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language
  Model Itself
SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself
Jun Gao
Ziqiang Cao
Wenjie Li
63
7
0
27 May 2024
CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and
  Evaluation Framework for Chinese Psychological Counseling
CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
Chenhao Zhang
Renhao Li
Minghuan Tan
Min Yang
Jingwei Zhu
Di Yang
Jiahao Zhao
Guancheng Ye
Chengming Li
Xiping Hu
133
29
0
26 May 2024
GeneAgent: Self-verification Language Agent for Gene Set Knowledge
  Discovery using Domain Databases
GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases
Zhizheng Wang
Qiao Jin
Chih-Hsuan Wei
Shubo Tian
Po-Ting Lai
Qingqing Zhu
Chi-Ping Day
Christina Ross
Zhiyong Lu
LLMAG
89
9
0
25 May 2024
SLIDE: A Framework Integrating Small and Large Language Models for
  Open-Domain Dialogues Evaluation
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation
Kun Zhao
Bohao Yang
Chen Tang
Chenghua Lin
Liang Zhan
79
5
0
24 May 2024
Organic Data-Driven Approach for Turkish Grammatical Error Correction
  and LLMs
Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs
Asim Ersoy
O. T. Yildiz
54
0
0
24 May 2024
CHARP: Conversation History AwaReness Probing for Knowledge-grounded
  Dialogue Systems
CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems
Abbas Ghaddar
David Alfonso-Hermelo
Philippe Langlais
Mehdi Rezagholizadeh
Boxing Chen
Prasanna Parthasarathi
69
0
0
24 May 2024
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer
  Selection in Large Language Models
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
Zhangyue Yin
Qiushi Sun
Qipeng Guo
Zhiyuan Zeng
Xiaonan Li
...
Qinyuan Cheng
Ding Wang
Xiaofeng Mou
Xipeng Qiu
XuanJing Huang
LRM
96
4
0
21 May 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with
  Minimal Impact on Coherence and Evasiveness in Dialogue Agents
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
124
3
0
21 May 2024
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large
  Language Models
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models
Haoxiang Shi
Jiaan Wang
Jiarong Xu
Cen Wang
Tetsuya Sakai
LMTD
66
0
0
20 May 2024
Selective Annotation via Data Allocation: These Data Should Be Triaged
  to Experts for Annotation Rather Than the Model
Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model
Chen Huang
Yang Deng
Wenqiang Lei
Jiancheng Lv
Ido Dagan
77
4
0
20 May 2024
Language Models can Evaluate Themselves via Probability Discrepancy
Language Models can Evaluate Themselves via Probability Discrepancy
Tingyu Xia
Bowen Yu
Yuan Wu
Yi-Ju Chang
Chang Zhou
ELM
112
5
0
17 May 2024
DEBATE: Devil's Advocate-Based Assessment and Text Evaluation
DEBATE: Devil's Advocate-Based Assessment and Text Evaluation
Alex G. Kim
Keonwoo Kim
Sangwon Yoon
ELM
57
7
0
16 May 2024
LLM Discussion: Enhancing the Creativity of Large Language Models via
  Discussion Framework and Role-Play
LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play
Li-Chun Lu
Shou-Jen Chen
Tsung-Min Pai
Chan-Hung Yu
Hung-yi Lee
Shao-Hua Sun
LLMAG
98
50
0
10 May 2024
Efficient LLM Comparative Assessment: a Product of Experts Framework for
  Pairwise Comparisons
Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons
Adian Liusie
Vatsal Raina
Yassir Fathullah
Mark Gales
104
12
0
09 May 2024
Evaluating Students' Open-ended Written Responses with LLMs: Using the
  RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large
Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large
Jussi S. Jauhiainen
Agustín Garagorry Guerra
91
5
0
08 May 2024
MIDGARD: Self-Consistency Using Minimum Description Length for
  Structured Commonsense Reasoning
MIDGARD: Self-Consistency Using Minimum Description Length for Structured Commonsense Reasoning
Inderjeet Nair
Lu Wang
LRM
49
1
0
08 May 2024
Assessing and Verifying Task Utility in LLM-Powered Applications
Assessing and Verifying Task Utility in LLM-Powered Applications
Negar Arabzadeh
Siging Huo
Nikhil Mehta
Qinqyun Wu
Chi Wang
Ahmed Hassan Awadallah
Charles L. A. Clarke
Julia Kiseleva
87
12
0
03 May 2024
Large Language Models are Inconsistent and Biased Evaluators
Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg
Dimitris Alikaniotis
Yoshi Suhara
ALM
123
66
0
02 May 2024
CEval: A Benchmark for Evaluating Counterfactual Text Generation
CEval: A Benchmark for Evaluating Counterfactual Text Generation
Van Bach Nguyen
Jorg Schlotterer
Christin Seifert
99
7
0
26 Apr 2024
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream
  Tasks with Collective Wisdom
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom
Yuanqin He
Yan Kang
Lixin Fan
Qiang Yang
62
3
0
18 Apr 2024
Previous
1234567
Next