ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.16634
  4. Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
v1v2v3 (latest)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
    ELMALMLM&MA
ArXiv (abs)PDFHTMLGithub (344★)

Papers citing "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

50 / 264 papers shown
Title
Cost-Saving LLM Cascades with Early Abstention
Cost-Saving LLM Cascades with Early Abstention
Michael J. Zellinger
Rex Liu
Matt Thomson
155
2
0
13 Feb 2025
Improve LLM-based Automatic Essay Scoring with Linguistic Features
Improve LLM-based Automatic Essay Scoring with Linguistic Features
Zhaoyi Joey Hou
Alejandro Ciuba
Xiang Lorraine Li
108
1
0
13 Feb 2025
Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights
Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights
Ahilan Ayyachamy Nadar Ponnusamy
150
1
0
11 Feb 2025
Aligning Black-box Language Models with Human Judgments
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
146
0
0
07 Feb 2025
Teaching Large Language Models Number-Focused Headline Generation With Key Element Rationales
Teaching Large Language Models Number-Focused Headline Generation With Key Element Rationales
Zhen Qian
Xiuzhen Zhang
Xiaofei Xu
Xiwei Xu
LRM
74
0
0
05 Feb 2025
LLM-based Affective Text Generation Quality Based on Different Quantization Values
LLM-based Affective Text Generation Quality Based on Different Quantization Values
Yarik Menchaca Resendiz
Roman Klinger
MQ
263
1
0
31 Jan 2025
Breaking the Stigma! Unobtrusively Probe Symptoms in Depression Disorder Diagnosis Dialogue
Jieming Cao
Chen Huang
Y. Zhang
Ruibo Deng
Jincheng Zhang
Wenqiang Lei
100
0
0
28 Jan 2025
SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval
Hossein A. Rahmani
Xi Wang
Emine Yilmaz
Nick Craswell
Bhaskar Mitra
Paul Thomas
128
7
0
28 Jan 2025
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim
Kyungjae Lee
Y. Jang
Ji Yong Cho
Gangwoo Kim
Minseok Cho
Moontae Lee
290
1
0
28 Jan 2025
Federated Retrieval Augmented Generation for Multi-Product Question Answering
Parshin Shojaee
Sai Sree Harsha
Dan Luo
Akash Maharaj
Tong Yu
Yunyao Li
105
4
0
28 Jan 2025
Learning to Summarize from LLM-generated Feedback
Learning to Summarize from LLM-generated Feedback
Hwanjun Song
Taewon Yun
Yuho Lee
Jihwan Oh
Gihun Lee
Jason (Jinglun) Cai
Hang Su
225
10
0
28 Jan 2025
Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression
Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression
Kai Yoshida
M. Mizukami
Seiya Kawano
Canasai Kruengkrai
Hiroaki Sugiyama
Koichiro Yoshino
ALMOffRL
134
1
0
28 Jan 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Aparna Elangovan
Jongwoo Ko
Lei Xu
Mahsa Elyasi
Ling Liu
S. Bodapati
Dan Roth
129
6
0
28 Jan 2025
Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
Mingqi Gao
Xinyu Hu
Li Lin
Xiaojun Wan
78
2
0
28 Jan 2025
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction
Yooseop Lee
Suin Kim
Yohan Jo
AI4Ed
154
2
0
21 Jan 2025
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Yinhong Liu
Han Zhou
Zhijiang Guo
Ehsan Shareghi
Ivan Vulić
Anna Korhonen
Nigel Collier
ALM
219
83
0
20 Jan 2025
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Junyi Ao
Yuancheng Wang
Xiaohai Tian
Dekun Chen
Jing Zhang
Lu Lu
Yansen Wang
Haizhou Li
Zhikai Wu
AuLLM
183
25
0
17 Jan 2025
PASS: Presentation Automation for Slide Generation and Speech
PASS: Presentation Automation for Slide Generation and Speech
Tushar Aggarwal
Aarohi Bhand
110
1
0
17 Jan 2025
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Measuring the Robustness of Reference-Free Dialogue Evaluation Systems
Justin Vasselli
Adam Nohejl
Taro Watanabe
AAML
87
0
0
12 Jan 2025
CodEv: An Automated Grading Framework Leveraging Large Language Models for Consistent and Constructive Feedback
CodEv: An Automated Grading Framework Leveraging Large Language Models for Consistent and Constructive Feedback
En-Qi Tseng
Pei-Cing Huang
Chan Hsu
Peng-Yi Wu
Chan-Tung Ku
Yihuang Kang
105
1
0
10 Jan 2025
Multi-LLM Collaborative Caption Generation in Scientific Documents
Multi-LLM Collaborative Caption Generation in Scientific Documents
Jaeyoung Kim
J. B. Lee
Hong-Jun Choi
Ting-Yao Hsu
Chieh-Yang Huang
...
Ryan Rossi
Tong Yu
C. Lee Giles
Ting-Hao 'Kenneth' Huang
S. Choi
105
3
0
05 Jan 2025
CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions
Mourad Heddaya
Kyle MacMillan
Anup Malani
Hongyuan Mei
Chenhao Tan
AILawELM
76
2
0
03 Jan 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Ruosen Li
Teerth Patel
Xinya Du
LLMAGALM
189
102
0
03 Jan 2025
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
Sheikh Shafayat
Dongkeun Yoon
Woori Jang
Jiwoo Choi
Alice Oh
Seohyon Jung
219
1
0
03 Jan 2025
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
Helia Hashemi
J. Eisner
Corby Rosset
Benjamin Van Durme
Chris Kedzie
145
6
0
03 Jan 2025
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Dong Yuan
Eti Rastogi
Fen Zhao
Sagar Goyal
Gautam Naik
Sree Prasanna Rajagopal
70
0
0
31 Dec 2024
Revisiting In-Context Learning with Long Context Language Models
Revisiting In-Context Learning with Long Context Language Models
Jinheon Baek
Sun Jae Lee
Prakhar Gupta
Geunseob
Oh
Siddharth Dalmia
674
3
0
22 Dec 2024
Towards Automatic Evaluation for Image Transcreation
Towards Automatic Evaluation for Image Transcreation
Simran Khanuja
Vivek Iyer
Claire He
Graham Neubig
ViT
153
2
0
18 Dec 2024
EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents
EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents
Mengna Zhu
Kaisheng Zeng
Mao Wang
Kaiming Xiao
Lei Hou
Hongbin Huang
Juanzi Li
528
1
0
16 Dec 2024
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General
  Reasoning in LLMs
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
169
0
0
16 Dec 2024
Leveraging Large Language Models for Active Merchant Non-player Characters
Leveraging Large Language Models for Active Merchant Non-player Characters
Byungjun Kim
Minju Kim
Dayeon Seo
Bugeun Kim
230
0
0
15 Dec 2024
Can the Rookies Cut the Tough Cookie? Exploring the Use of LLMs for SQL Equivalence Checking
Can the Rookies Cut the Tough Cookie? Exploring the Use of LLMs for SQL Equivalence Checking
Rajat Singh
Srikanta J. Bedathur
171
3
0
07 Dec 2024
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
Xueqing Wu
Yuheng Ding
Bingxuan Li
Pan Lu
Da Yin
Kai-Wei Chang
Nanyun Peng
LRM
159
4
0
03 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELMAILaw
398
112
0
25 Nov 2024
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Bo Yang
Qingping Yang
Runtao Liu
Runtao Liu
LRMReLMELMAIMat
155
1
0
11 Nov 2024
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Guan-Ting Lin
Prashanth Gurunath Shivakumar
Aditya Gourav
Yile Gu
Ankur Gandhe
Hung-yi Lee
I. Bulyko
124
9
0
04 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MALRM
142
5
0
04 Nov 2024
Graph-based Confidence Calibration for Large Language Models
Graph-based Confidence Calibration for Large Language Models
Yukun Li
Sijia Wang
Lifu Huang
Li-Ping Liu
UQCV
198
2
0
03 Nov 2024
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALMLRM
188
0
0
03 Nov 2024
Comparison-based Active Preference Learning for Multi-dimensional Personalization
Comparison-based Active Preference Learning for Multi-dimensional Personalization
Minhyeon Oh
Seungjoon Lee
Jungseul Ok
72
1
0
01 Nov 2024
On Positional Bias of Faithfulness for Long-form Summarization
On Positional Bias of Faithfulness for Long-form Summarization
David Wan
Jesse Vig
Joey Tianyi Zhou
Shafiq Joty
HILM
112
8
0
31 Oct 2024
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu
Bowen Shi
Avi Caciularu
Idan Szpektor
Arman Cohan
162
4
0
30 Oct 2024
SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script
SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script
Eunwon Kim
Chanho Park
Buru Chang
80
2
0
28 Oct 2024
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
Jaechang Kim
Jinmin Goh
Inseok Hwang
Jaewoong Cho
Jungseul Ok
ELM
93
2
0
28 Oct 2024
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Dongryeol Lee
Yerin Hwang
Yongil Kim
Joonsuk Park
Kyomin Jung
ELM
166
10
0
28 Oct 2024
An Auditing Test To Detect Behavioral Shift in Language Models
An Auditing Test To Detect Behavioral Shift in Language Models
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
97
0
0
25 Oct 2024
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage
Taewhoo Lee
Chanwoong Yoon
Kyochul Jang
Donghyeon Lee
Minju Song
Hyunjae Kim
Jaewoo Kang
ELM
89
1
0
22 Oct 2024
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Hamed Fayyaz
Raphael Poulain
Rahmatollah Beheshti
109
2
0
18 Oct 2024
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Xiongtao Zhou
Jie He
Lanyu Chen
Jingyu Li
Haojing Chen
Víctor Gutiérrez-Basulto
Jeff Z. Pan
Ningyu Zhang
LRM
193
2
0
18 Oct 2024
FaithBench: A Diverse Hallucination Benchmark for Summarization by
  Modern LLMs
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
F. S. Bao
Miaoran Li
Renyi Qu
Ge Luo
Erana Wan
...
Ruixuan Tu
Chenyu Xu
Matthew Gonzales
Ofer Mendelevitch
Amin Ahmad
VLMHILM
93
7
0
17 Oct 2024
Previous
123456
Next