ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.16634
  4. Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
    ELM
    ALM
    LM&MA
ArXivPDFHTML

Papers citing "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

50 / 765 papers shown
Title
FoRAG: Factuality-optimized Retrieval Augmented Generation for
  Web-enhanced Long-form Question Answering
FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering
Tianchi Cai
Zhiwen Tan
Xierui Song
Tao Sun
Jiyan Jiang
Yunqi Xu
Yinger Zhang
Jinjie Gu
32
5
0
19 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
46
11
0
19 Jun 2024
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and
  Metrics for Open Domain Question Answering in the Era of Large Language
  Models
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models
Akchay Srivastava
Atif Memon
ELM
48
1
0
19 Jun 2024
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding
  Dataset from Ruo Zhi Ba
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Ruiqi He
Yushu He
Longju Bai
Jiarui Liu
Zhenjie Sun
Zenghao Tang
He Wang
Hanchen Xia
Naihao Deng
38
0
0
18 Jun 2024
Measuring Psychological Depth in Language Models
Measuring Psychological Depth in Language Models
Fabrice Harel-Canada
Hanyu Zhou
Sreya Mupalla
Zeynep Yildiz
Amit Sahai
Nanyun Peng
45
3
0
18 Jun 2024
LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document
  Summarization
LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization
Masafumi Enomoto
Kunihiro Takeoka
Kosuke Akimoto
Kiril Gashteovski
Masafumi Oyamada
RALM
32
1
0
18 Jun 2024
AI-Assisted Human Evaluation of Machine Translation
AI-Assisted Human Evaluation of Machine Translation
Vilém Zouhar
Tom Kocmi
Mrinmaya Sachan
51
5
0
18 Jun 2024
Unveiling Implicit Table Knowledge with Question-Then-Pinpoint Reasoner
  for Insightful Table Summarization
Unveiling Implicit Table Knowledge with Question-Then-Pinpoint Reasoner for Insightful Table Summarization
Kwangwook Seo
Jinyoung Yeo
Dongha Lee
ReLM
LMTD
LRM
31
1
0
18 Jun 2024
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method
  using GPT-4
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4
Ming Gu
Yan Yang
26
0
0
17 Jun 2024
Fairer Preferences Elicit Improved Human-Aligned Large Language Model
  Judgments
Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
Han Zhou
Xingchen Wan
Yinhong Liu
Nigel Collier
Ivan Vulić
Anna Korhonen
ALM
44
9
0
17 Jun 2024
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark
Hiromi Wakaki
Yuki Mitsufuji
Yoshinori Maeda
Yukiko Nishimura
Silin Gao
Mengjie Zhao
Keiichi Yamada
Antoine Bosselut
52
0
0
17 Jun 2024
Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large
  Language Models
Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models
Scott Barnett
Zac Brannelly
Stefanus Kurniawan
Sheng Wong
LRM
35
1
0
17 Jun 2024
Aligning Large Language Models from Self-Reference AI Feedback with one
  General Principle
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle
Rong Bao
Rui Zheng
Shihan Dou
Xiao Wang
Enyu Zhou
Bo Wang
Qi Zhang
Liang Ding
Dacheng Tao
ALM
55
0
0
17 Jun 2024
Incentivizing Quality Text Generation via Statistical Contracts
Incentivizing Quality Text Generation via Statistical Contracts
Eden Saig
Ohad Einav
Inbal Talgam-Cohen
33
2
0
17 Jun 2024
Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?
Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?
Guan-Ting Lin
Hung-yi Lee
48
3
0
16 Jun 2024
Towards Lifelong Dialogue Agents via Timeline-based Memory Management
Towards Lifelong Dialogue Agents via Timeline-based Memory Management
Kai Tzu-iunn Ong
Kai Tzu-iunn Ong
Taeyoon Kwon
Namyoung Kim
Keummin Ka
SeongHyeon Bae
Yohan Jo
Dongha Lee
Dongha Lee
KELM
RALM
40
2
0
16 Jun 2024
SciEx: Benchmarking Large Language Models on Scientific Exams with Human
  Expert Grading and Automatic Grading
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Tu Anh Dinh
Carlos Mullov
Leonard Barmann
Zhaolin Li
Danni Liu
...
Michael Beigl
Rainer Stiefelhagen
Carsten Dachsbacher
Klemens Bohm
Jan Niehues
ELM
45
8
0
14 Jun 2024
Improving the Validity and Practical Usefulness of AI/ML Evaluations
  Using an Estimands Framework
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework
Olivier Binette
Jerome P. Reiter
38
0
0
14 Jun 2024
A Better LLM Evaluator for Text Generation: The Impact of Prompt Output
  Sequencing and Optimization
A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
58
9
0
14 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGen
LRM
51
12
0
12 Jun 2024
DCA-Bench: A Benchmark for Dataset Curation Agents
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang
Yingzhuo Yu
Jin Huang
Xingjian Zhang
Jiaqi Ma
36
1
0
11 Jun 2024
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak
  Attacks
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
Zonghao Ying
Aishan Liu
Xianglong Liu
Dacheng Tao
62
18
0
10 Jun 2024
FLEUR: An Explainable Reference-Free Evaluation Metric for Image
  Captioning Using a Large Multimodal Model
FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model
Yebin Lee
Imseong Park
Myungjoo Kang
40
11
0
10 Jun 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
110
34
0
09 Jun 2024
Think out Loud: Emotion Deducing Explanation in Dialogues
Think out Loud: Emotion Deducing Explanation in Dialogues
JiangNan Li
Zheng Lin
Lanrui Wang
Q. Si
Yanan Cao
Mo Yu
Peng Fu
Weiping Wang
Jie Zhou
44
0
0
07 Jun 2024
Key-Element-Informed sLLM Tuning for Document Summarization
Key-Element-Informed sLLM Tuning for Document Summarization
Sangwon Ryu
Heejin Do
Yunsu Kim
G. G. Lee
Jungseul Ok
37
6
0
07 Jun 2024
Synthesizing Conversations from Unlabeled Documents using Automatic
  Response Segmentation
Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation
Fanyou Wu
Weijie Xu
Chandan K. Reddy
Srinivasan H. Sengamedu
31
0
0
06 Jun 2024
Large Language Models as Evaluators for Recommendation Explanations
Large Language Models as Evaluators for Recommendation Explanations
Xiaoyu Zhang
Yishan Li
Jiayin Wang
Bowen Sun
Weizhi Ma
Peijie Sun
Min Zhang
LRM
ELM
50
12
0
05 Jun 2024
LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
Yi-Pei Chen
Kuanchao Chu
Hideki Nakayama
LRM
29
1
0
05 Jun 2024
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of
  Self-Correction of LLMs
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
Ryo Kamoi
Yusen Zhang
Nan Zhang
Jiawei Han
Rui Zhang
LRM
50
61
0
03 Jun 2024
Enhancing Presentation Slide Generation by LLMs with a Multi-Staged
  End-to-End Approach
Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach
Sambaran Bandyopadhyay
Himanshu Maheshwari
Anandhavelu Natarajan
Apoorv Saxena
44
5
0
01 Jun 2024
Multi-Dimensional Optimization for Text Summarization via Reinforcement
  Learning
Multi-Dimensional Optimization for Text Summarization via Reinforcement Learning
Sangwon Ryu
Heejin Do
Yunsu Kim
Gary Geunbae Lee
Jungseul Ok
34
3
0
01 Jun 2024
Amortizing intractable inference in diffusion models for vision, language, and control
Amortizing intractable inference in diffusion models for vision, language, and control
S. Venkatraman
Moksh Jain
Luca Scimeca
Minsu Kim
Marcin Sendera
...
Alexandre Adam
Jarrid Rector-Brooks
Yoshua Bengio
Glen Berseth
Nikolay Malkin
70
26
0
31 May 2024
FineRadScore: A Radiology Report Line-by-Line Evaluation Technique
  Generating Corrections with Severity Scores
FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores
Alyssa Huang
Oishi Banerjee
Kay Wu
Eduardo Pontes Reis
Pranav Rajpurkar
MedIm
LM&MA
45
6
0
31 May 2024
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles
  and Committee Discussions
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Ruochen Zhao
Wenxuan Zhang
Yew Ken Chia
Deli Zhao
Lidong Bing
41
10
0
30 May 2024
X-Instruction: Aligning Language Model in Low-resource Languages with
  Self-curated Cross-lingual Instructions
X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
Chong Li
Wen Yang
Jiajun Zhang
Jinliang Lu
Shaonan Wang
Chengqing Zong
49
6
0
30 May 2024
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals
Phillip Howard
Kathleen C. Fraser
Anahita Bhiwandiwalla
S. Kiritchenko
57
9
0
30 May 2024
Unlearning Climate Misinformation in Large Language Models
Unlearning Climate Misinformation in Large Language Models
Michael Fore
Simranjit Singh
Chaehong Lee
Amritanshu Pandey
Antonios Anastasopoulos
Dimitrios Stamoulis
MU
62
1
0
29 May 2024
Cracking the Code of Juxtaposition: Can AI Models Understand the
  Humorous Contradictions
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Zhe Hu
Tuo Liang
Jing Li
Yiren Lu
Yunlai Zhou
Yiran Qiao
Jing Ma
Yu Yin
55
4
0
29 May 2024
CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and
  Evaluation Framework for Chinese Psychological Counseling
CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
Chenhao Zhang
Renhao Li
Minghuan Tan
Min Yang
Jingwei Zhu
Di Yang
Jiahao Zhao
Guancheng Ye
Chengming Li
Xiping Hu
60
21
0
26 May 2024
Towards Completeness-Oriented Tool Retrieval for Large Language Models
Towards Completeness-Oriented Tool Retrieval for Large Language Models
Changle Qu
Sunhao Dai
Xiaochi Wei
Hengyi Cai
Shuaiqiang Wang
Dawei Yin
Jun Xu
Jirong Wen
KELM
33
7
0
25 May 2024
SLIDE: A Framework Integrating Small and Large Language Models for
  Open-Domain Dialogues Evaluation
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation
Kun Zhao
Bohao Yang
Chen Tang
Chenghua Lin
Liang Zhan
49
5
0
24 May 2024
CHARP: Conversation History AwaReness Probing for Knowledge-grounded
  Dialogue Systems
CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems
Abbas Ghaddar
David Alfonso-Hermelo
Philippe Langlais
Mehdi Rezagholizadeh
Boxing Chen
Prasanna Parthasarathi
44
0
0
24 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
138
54
3
23 May 2024
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial
  Framework Driven by Large Language Models
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models
Yiming Chen
Chen Zhang
Danqing Luo
L. F. D’Haro
R. Tan
Haizhou Li
AAML
ELM
50
2
0
23 May 2024
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles
Sarah Clinckemaillie
Yifan Chang
Jonathan Waltz
Gabrielle Lau
...
Daniel Toyama
Robert Berry
Divya Tyamagundlu
Timothy Lillicrap
Oriana Riva
LLMAG
72
44
0
23 May 2024
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer
  Selection in Large Language Models
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
Zhangyue Yin
Qiushi Sun
Qipeng Guo
Zhiyuan Zeng
Xiaonan Li
...
Qinyuan Cheng
Ding Wang
Xiaofeng Mou
Xipeng Qiu
XuanJing Huang
LRM
51
4
0
21 May 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with
  Minimal Impact on Coherence and Evasiveness in Dialogue Agents
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
43
3
0
21 May 2024
Presentations are not always linear! GNN meets LLM for
  Document-to-Presentation Transformation with Attribution
Presentations are not always linear! GNN meets LLM for Document-to-Presentation Transformation with Attribution
Himanshu Maheshwari
Sambaran Bandyopadhyay
Aparna Garimella
Anandhavelu Natarajan
16
4
0
21 May 2024
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Minbyul Jeong
Hyeon Hwang
Chanwoong Yoon
Taewhoo Lee
Jaewoo Kang
MedIm
HILM
LM&MA
53
12
0
21 May 2024
Previous
123...789...141516
Next