ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.16634
  4. Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
    ELM
    ALM
    LM&MA
ArXivPDFHTML

Papers citing "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

50 / 757 papers shown
Title
Enhancing Abstractive Summarization of Scientific Papers Using Structure Information
Enhancing Abstractive Summarization of Scientific Papers Using Structure Information
Tong Bao
Heng Zhang
Chengzhi Zhang
2
0
0
20 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
7
0
0
20 May 2025
R3: Robust Rubric-Agnostic Reward Models
R3: Robust Rubric-Agnostic Reward Models
David Anugraha
Zilu Tang
Lester James V. Miranda
Hanyang Zhao
Mohammad Rifqi Farhansyah
Garry Kuwanto
Derry Wijaya
Genta Indra Winata
9
0
0
19 May 2025
Enriching Patent Claim Generation with European Patent Dataset
Enriching Patent Claim Generation with European Patent Dataset
Lekang Jiang
Chengzu Li
Stephan Goetz
7
0
0
18 May 2025
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Luyu Chen
Zeyu Zhang
Haoran Tan
Quanyu Dai
Hao-ran Yang
Zhenhua Dong
Xu Chen
4
0
0
18 May 2025
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization
Weixiao Zhou
Junnan Zhu
Gengyao Li
Xianfu Cheng
Xinnian Liang
Feifei Zhai
Zhiyu Li
ALM
7
0
0
18 May 2025
From Recall to Reasoning: Automated Question Generation for Deeper Math Learning through Large Language Models
From Recall to Reasoning: Automated Question Generation for Deeper Math Learning through Large Language Models
Yongan Yu
Alexandre Krantz
Nikki G. Lobczowski
LRM
7
0
0
17 May 2025
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
X. Zhang
Zetian Ouyang
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Ya Zhang
Yanfeng Wang
Liang He
LM&MA
ELM
18
0
0
17 May 2025
Towards Better Evaluation for Generated Patent Claims
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
Pascal A Scherz
Stephan Goetz
ELM
30
0
0
16 May 2025
Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects
Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects
Tobias Preintner
Weixuan Yuan
Qi Huang
Adrian König
Thomas Bäck
E. Raponi
Niki van Stein
34
0
0
09 May 2025
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
54
0
0
08 May 2025
SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation
SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation
Tanguy Herserant
Vincent Guigue
ELM
45
0
0
04 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
Yiming Li
Shangqing Tu
...
Zhiyuan Liu
Huiqin Liu
Lei Hou
Juanzi Li
Bin Xu
26
0
0
04 May 2025
LookAlike: Consistent Distractor Generation in Math MCQs
LookAlike: Consistent Distractor Generation in Math MCQs
Nisarg Parikh
Nigel Fernandez
Alexander Scarlatos
Simon Woodhead
Andrew S. Lan
53
0
0
03 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
H. Wang
Yuxiao Chen
Qingyun Wu
49
1
0
30 Apr 2025
JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
Anum Afzal
Alexandre Mercier
Florian Matthes
65
0
0
29 Apr 2025
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Hanhua Hong
Chenghao Xiao
Yang Wang
Y. Liu
Wenge Rong
Chenghua Lin
31
0
0
29 Apr 2025
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces
Juliana Barbosa
Ulhas Gondhali
Gohar Petrossian
Kinshuk Sharma
Sunandan Chakraborty
Jennifer Jacquet
Juliana Freire
31
0
0
29 Apr 2025
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Mihai Nadas
Laura Diosan
Andrei Piscoran
Andreea Tomescu
VGen
59
0
0
29 Apr 2025
Automatic Legal Writing Evaluation of LLMs
Automatic Legal Writing Evaluation of LLMs
Ramon Pires
Roseval Malaquias Junior
Rodrigo Nogueira
AILaw
ELM
83
0
0
29 Apr 2025
Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge
Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge
Wenhan Mu
Ling Xu
Shuren Pei
Le Mi
Huichi Zhou
AAML
ELM
53
0
0
28 Apr 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
50
0
0
27 Apr 2025
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
Jiabin Fan
Guoqing Luo
Michael Bowling
Lili Mou
OffRL
68
0
0
26 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
91
2
0
26 Apr 2025
An Empirical Study of Evaluating Long-form Question Answering
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
37
0
0
25 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
85
0
0
25 Apr 2025
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
Yangxinyu Xie
Bowen Jiang
Tanwi Mallick
Joshua Bergerson
John K Hutchison
...
Robert B. Ross
Yan Feng
L. Levy
Weijie J. Su
Camillo J Taylor
37
1
0
24 Apr 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo
Jinheon Baek
Seongyun Lee
Sung Ju Hwang
AI4CE
44
0
0
24 Apr 2025
Process Reward Models That Think
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
44
1
0
23 Apr 2025
Med-CoDE: Medical Critique based Disagreement Evaluation Framework
Med-CoDE: Medical Critique based Disagreement Evaluation Framework
Mohit Gupta
Akiko Aizawa
R. Shah
LM&MA
ELM
32
0
0
21 Apr 2025
Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval
Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval
Yong-En Tian
Yu-Chien Tang
Kuang-Da Wang
An-Zi Yen
Wen-Chih Peng
AIFin
49
0
0
19 Apr 2025
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili
Beatrice Savoldi
Matteo Negri
L. Bentivogli
ELM
37
0
0
16 Apr 2025
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
Minqian Liu
Zhiyang Xu
Xinyi Zhang
Heajun An
Sarvech Qadir
...
Pamela J. Wisniewski
Jin-Hee Cho
Sang Won Lee
Ruoxi Jia
Lifu Huang
29
1
0
14 Apr 2025
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
Dayu Yang
Antoine Simoulin
Xin Qian
Xiaoyi Liu
Yuwei Cao
Zhaopu Teng
Grey Yang
LLMAG
59
0
0
11 Apr 2025
LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
Jun Wang
Zhengyuan Zhu
Zeyu Zhang
Chengkai Li
27
0
0
11 Apr 2025
Large Language Models as Span Annotators
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
40
0
0
11 Apr 2025
From Speech to Summary: A Comprehensive Survey of Speech Summarization
From Speech to Summary: A Comprehensive Survey of Speech Summarization
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
46
0
0
10 Apr 2025
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Daniil Larionov
Sotaro Takeshita
Ran Zhang
Yanran Chen
Christoph Leiter
Zhipin Wang
Christian Greisinger
Steffen Eger
ReLM
ELM
LRM
74
1
0
10 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
45
0
0
10 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALM
ELM
49
0
0
09 Apr 2025
ARLO: A Tailorable Approach for Transforming Natural Language Software Requirements into Architecture using LLMs
ARLO: A Tailorable Approach for Transforming Natural Language Software Requirements into Architecture using LLMs
Tooraj Helmi
35
0
0
08 Apr 2025
FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking
FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking
Islam Eldifrawi
Shengrui Wang
Amine Trabelsi
29
0
0
07 Apr 2025
Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching
Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching
John Paisley
Wei Zhang
Brian Barr
46
0
0
04 Apr 2025
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
Xiang Wang
Daniil Larionov
Siwei Wu
Yiqi Liu
Steffen Eger
N. Moosavi
Chenghua Lin
ALM
52
1
0
02 Apr 2025
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
ALM
75
0
0
01 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
66
1
0
01 Apr 2025
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models
Rafael Giebisch
Ken E. Friedl
Lev Sorokin
Andrea Stocco
HILM
55
0
0
01 Apr 2025
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
Vivek Iyer
Ricardo Rei
Pinzhen Chen
Alexandra Birch
SyDa
LM&MA
73
0
0
29 Mar 2025
MemInsight: Autonomous Memory Augmentation for LLM Agents
MemInsight: Autonomous Memory Augmentation for LLM Agents
Rana Salama
Jason (Jinglun) Cai
Michelle Yuan
Anna Currey
Monica Sunkara
Yi Zhang
Yassine Benajiba
LLMAG
RALM
89
1
0
27 Mar 2025
JEEM: Vision-Language Understanding in Four Arabic Dialects
JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui
Hanin Atwany
Hamdan Al-Ali
Abdelrahman Mohamed
Ali Mekky
Sergei Tilga
Natalia Fedorova
Ekaterina Artemova
Hanan Aldarmaki
Yova Kementchedjhieva
VLM
51
1
0
27 Mar 2025
1234...141516
Next