ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.16634
  4. Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
v1v2v3 (latest)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
    ELMALMLM&MA
ArXiv (abs)PDFHTMLGithub (344★)

Papers citing "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

50 / 264 papers shown
Title
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Nandan Thakur
Suleman Kazi
Ge Luo
Jimmy J. Lin
Amin Ahmad
VLMRALM
210
7
0
17 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
142
3
0
17 Oct 2024
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Yuzhe Yang
Yifei Zhang
Yan Hu
Y. Guo
Ruoli Gan
...
Haining Wang
Qianqian Xie
Jimin Huang
Honghai Yu
Benyou Wang
ELMAIFin
115
2
0
17 Oct 2024
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
Xiaonan Jing
Srinivas Billa
Danny Godbout
HILM
131
0
0
16 Oct 2024
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data
Seiji Maekawa
Hayate Iso
Nikita Bhutani
RALM
218
2
0
15 Oct 2024
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Haotian Tang
Yecheng Wu
Shang Yang
Enze Xie
Junsong Chen
Junyu Chen
Zhuoyang Zhang
Han Cai
Yaojie Lu
Song Han
222
48
0
14 Oct 2024
Language Model Preference Evaluation with Multiple Weak Evaluators
Language Model Preference Evaluation with Multiple Weak Evaluators
Zhengyu Hu
Jieyu Zhang
Zhihan Xiong
Alexander Ratner
Hui Xiong
Ranjay Krishna
184
5
0
14 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
155
1
0
14 Oct 2024
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Yi-Fan Lu
Xian-Ling Mao
Tian Lan
Heyan Huang
Heyan Huang
Xiaoyan Gao
87
0
0
12 Oct 2024
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
H. Xia
Zhengbang Yang
Junbo Zou
Rhys Tracy
Yuqing Wang
...
Xun Shao
Zhuoqing Xie
Yuan-fang Wang
Weining Shen
Hanjie Chen
ReLMLRMELM
127
4
0
11 Oct 2024
Language Imbalance Driven Rewarding for Multilingual Self-improving
Language Imbalance Driven Rewarding for Multilingual Self-improving
Wen Yang
Junhong Wu
Chen Wang
Chengqing Zong
J.N. Zhang
ALMLRM
219
7
0
11 Oct 2024
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui
Yufei He
Zifeng Ding
Bryan Hooi
HILMRALMELM
150
10
0
10 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
89
13
0
09 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang
Yufei Wang
Tiezheng YU
Yuxin Jiang
Chuhan Wu
...
Xin Jiang
Lifeng Shang
Ruiming Tang
Fuyuan Lyu
Chen Ma
131
7
0
07 Oct 2024
MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences
MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences
Genta Indra Winata
David Anugraha
Lucky Susanto
Garry Kuwanto
Derry Wijaya
181
11
0
03 Oct 2024
CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering
CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering
Yike Wu
Yi Huang
Nan Hu
Yuncheng Hua
Guilin Qi
Jiaoyan Chen
Jeff Z. Pan
120
9
0
29 Sep 2024
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning
What Would You Ask When You First Saw a2+b2=c2a^2+b^2=c^2a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning
Shashidhar Reddy Javaji
Zining Zhu
ELMALM
67
1
0
19 Sep 2024
Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling
Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling
Xinyue Fang
Zhen Huang
Zhiliang Tian
Minghui Fang
Ziyi Pan
Quntian Fang
Zhihua Wen
Hengyue Pan
Dongsheng Li
HILM
142
2
0
17 Sep 2024
Can Unconfident LLM Annotations Be Used for Confident Conclusions?
Can Unconfident LLM Annotations Be Used for Confident Conclusions?
Kristina Gligorić
Tijana Zrnic
Cinoo Lee
Emmanuel J. Candès
Dan Jurafsky
189
12
0
27 Aug 2024
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency
Peiwen Yuan
Shaoxiong Feng
Yiwei Li
Xinglin Wang
Boyuan Pan
Heda Wang
Yao Hu
Kan Li
77
1
0
25 Aug 2024
DHP Benchmark: Are LLMs Good NLG Evaluators?
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
Jiayi Yuan
Yu-Neng Chuang
Zhuoer Wang
Yingchi Liu
Mark Cusick
Param Kulkarni
Zhengping Ji
Yasser Ibrahim
Xia Hu
LM&MAELM
125
4
0
25 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALMELM
199
32
0
23 Aug 2024
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of
  Free-Form Text
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
Sher Badshah
Hassan Sajjad
ELM
100
14
0
17 Aug 2024
DataNarrative: Automated Data-Driven Storytelling with Visualizations
  and Texts
DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts
Mohammed Saidul Islam
Md Tahmid Rahman Laskar
Md. Rizwan Parvez
Enamul Hoque
Shafiq Joty
DiffM
97
7
0
09 Aug 2024
Retrieval-augmented code completion for local projects using large language models
Retrieval-augmented code completion for local projects using large language models
Marko Hostnik
Marko Robnik-Sikonja
RALM
83
1
0
09 Aug 2024
Automated Review Generation Method Based on Large Language Models
Automated Review Generation Method Based on Large Language Models
Shican Wu
Xiao Ma
Dehui Luo
Lulu Li
Xiangcheng Shi
...
Ran Luo
Chunlei Pei
Zhijian Zhao
Zhi-Jian Zhao
Jinlong Gong
175
0
0
30 Jul 2024
PersonaGym: Evaluating Persona Agents and LLMs
PersonaGym: Evaluating Persona Agents and LLMs
Vinay Samuel
Henry Peng Zou
Yue Zhou
Shreyas Chaudhari
Ashwin Kalyan
Tanmay Rajpurohit
Ameet Deshpande
Karthik Narasimhan
Vishvak Murahari
LLMAG
121
31
0
25 Jul 2024
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues
John Mendonça
Isabel Trancoso
A. Lavie
70
3
0
16 Jul 2024
GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation
  Framework
GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework
Hannah Sansford
Nicholas Richardson
Hermina Petric Maretic
Juba Nait Saada
89
17
0
15 Jul 2024
Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated
  Dialogues
Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated Dialogues
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
LLMAG
88
5
0
13 Jul 2024
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Shraman Pramanick
Rama Chellappa
Subhashini Venugopalan
118
21
0
12 Jul 2024
Large Language Models as Biomedical Hypothesis Generators: A
  Comprehensive Evaluation
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
Biqing Qi
Kaiyan Zhang
Kai Tian
Haoxiang Li
Zhang-Ren Chen
Sihang Zeng
Ermo Hua
Hu Jinfang
Bowen Zhou
LM&MA
127
18
0
12 Jul 2024
On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard
  of Oz Experiments
On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard of Oz Experiments
Jingchao Fang
Nikos Aréchiga
Keiichi Namaoshi
N. Bravo
Candice L Hogan
David A. Shamma
73
5
0
10 Jul 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in
  Large Language Models Using Only Attention Maps
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang
Linlu Qiu
Cheng-Yu Hsieh
Ranjay Krishna
Yoon Kim
James R. Glass
HILM
92
48
0
09 Jul 2024
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Gaurav Sahu
Abhay Puri
Juan A. Rodriguez
Alexandre Drouin
Perouz Taslakian
...
Christopher Pal
Nicolas Chapados
I. Laradji
Sai Rajeswar Mudumba
Issam Hadj Laradji
ELM
136
7
0
08 Jul 2024
Enhancing Hallucination Detection through Perturbation-Based Synthetic
  Data Generation in System Responses
Enhancing Hallucination Detection through Perturbation-Based Synthetic Data Generation in System Responses
Dongxu Zhang
Varun Gangal
B. Lattimer
Yi Yang
74
6
0
07 Jul 2024
Free-text Rationale Generation under Readability Level Control
Free-text Rationale Generation under Readability Level Control
Yi-Sheng Hsu
Nils Feldhus
Sherzod Hakimov
114
2
0
01 Jul 2024
When Search Engine Services meet Large Language Models: Visions and
  Challenges
When Search Engine Services meet Large Language Models: Visions and Challenges
Haoyi Xiong
Jiang Bian
Yuchen Li
Xuhong Li
Jundong Li
Shuaiqiang Wang
D. Yin
Sumi Helal
146
36
0
28 Jun 2024
Can Large Language Models Generate High-quality Patent Claims?
Can Large Language Models Generate High-quality Patent Claims?
Lekang Jiang
Caiqi Zhang
Pascal A Scherz
Stephan Goetz
ELM
127
7
0
27 Jun 2024
ConvoCache: Smart Re-Use of Chatbot Responses
ConvoCache: Smart Re-Use of Chatbot Responses
Conor Atkins
Ian D. Wood
M. Kâafar
Hassan Jameel Asghar
Nardine Basta
Michal Kepkowski
101
0
0
26 Jun 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALMELM
193
88
0
26 Jun 2024
FoRAG: Factuality-optimized Retrieval Augmented Generation for
  Web-enhanced Long-form Question Answering
FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering
Tianchi Cai
Zhiwen Tan
Xierui Song
Tao Sun
Jiyan Jiang
Yunqi Xu
Yinger Zhang
Jinjie Gu
82
7
0
19 Jun 2024
AI-Assisted Human Evaluation of Machine Translation
AI-Assisted Human Evaluation of Machine Translation
Vilém Zouhar
Tom Kocmi
Mrinmaya Sachan
133
7
0
18 Jun 2024
Incentivizing Quality Text Generation via Statistical Contracts
Incentivizing Quality Text Generation via Statistical Contracts
Eden Saig
Ohad Einav
Inbal Talgam-Cohen
94
5
0
17 Jun 2024
SciEx: Benchmarking Large Language Models on Scientific Exams with Human
  Expert Grading and Automatic Grading
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Tu Anh Dinh
Carlos Mullov
Leonard Barmann
Zhaolin Li
Danni Liu
...
Michael Beigl
Rainer Stiefelhagen
Carsten Dachsbacher
Klemens Bohm
Jan Niehues
ELM
91
12
0
14 Jun 2024
A Better LLM Evaluator for Text Generation: The Impact of Prompt Output
  Sequencing and Optimization
A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
117
10
0
14 Jun 2024
DCA-Bench: A Benchmark for Dataset Curation Agents
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang
Yingzhuo Yu
Jin Huang
Xingjian Zhang
Jiaqi Ma
130
1
0
11 Jun 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELMALMLM&MA
208
44
0
09 Jun 2024
Amortizing intractable inference in diffusion models for vision, language, and control
Amortizing intractable inference in diffusion models for vision, language, and control
S. Venkatraman
Moksh Jain
Luca Scimeca
Minsu Kim
Marcin Sendera
...
Alexandre Adam
Jarrid Rector-Brooks
Yoshua Bengio
Glen Berseth
Nikolay Malkin
191
32
0
31 May 2024
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Phillip Howard
Kathleen C. Fraser
Anahita Bhiwandiwalla
S. Kiritchenko
148
13
0
30 May 2024
Previous
123456
Next