Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.16634
Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
50 / 757 papers shown
Title
Enhancing Abstractive Summarization of Scientific Papers Using Structure Information
Tong Bao
Heng Zhang
Chengzhi Zhang
2
0
0
20 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
7
0
0
20 May 2025
R3: Robust Rubric-Agnostic Reward Models
David Anugraha
Zilu Tang
Lester James V. Miranda
Hanyang Zhao
Mohammad Rifqi Farhansyah
Garry Kuwanto
Derry Wijaya
Genta Indra Winata
9
0
0
19 May 2025
Enriching Patent Claim Generation with European Patent Dataset
Lekang Jiang
Chengzu Li
Stephan Goetz
7
0
0
18 May 2025
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Luyu Chen
Zeyu Zhang
Haoran Tan
Quanyu Dai
Hao-ran Yang
Zhenhua Dong
Xu Chen
4
0
0
18 May 2025
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization
Weixiao Zhou
Junnan Zhu
Gengyao Li
Xianfu Cheng
Xinnian Liang
Feifei Zhai
Zhiyu Li
ALM
7
0
0
18 May 2025
From Recall to Reasoning: Automated Question Generation for Deeper Math Learning through Large Language Models
Yongan Yu
Alexandre Krantz
Nikki G. Lobczowski
LRM
7
0
0
17 May 2025
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
X. Zhang
Zetian Ouyang
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Ya Zhang
Yanfeng Wang
Liang He
LM&MA
ELM
18
0
0
17 May 2025
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
Pascal A Scherz
Stephan Goetz
ELM
30
0
0
16 May 2025
Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects
Tobias Preintner
Weixuan Yuan
Qi Huang
Adrian König
Thomas Bäck
E. Raponi
Niki van Stein
34
0
0
09 May 2025
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
54
0
0
08 May 2025
SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation
Tanguy Herserant
Vincent Guigue
ELM
45
0
0
04 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
Yiming Li
Shangqing Tu
...
Zhiyuan Liu
Huiqin Liu
Lei Hou
Juanzi Li
Bin Xu
26
0
0
04 May 2025
LookAlike: Consistent Distractor Generation in Math MCQs
Nisarg Parikh
Nigel Fernandez
Alexander Scarlatos
Simon Woodhead
Andrew S. Lan
53
0
0
03 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
H. Wang
Yuxiao Chen
Qingyun Wu
49
1
0
30 Apr 2025
JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
Anum Afzal
Alexandre Mercier
Florian Matthes
65
0
0
29 Apr 2025
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Hanhua Hong
Chenghao Xiao
Yang Wang
Y. Liu
Wenge Rong
Chenghua Lin
31
0
0
29 Apr 2025
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces
Juliana Barbosa
Ulhas Gondhali
Gohar Petrossian
Kinshuk Sharma
Sunandan Chakraborty
Jennifer Jacquet
Juliana Freire
31
0
0
29 Apr 2025
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Mihai Nadas
Laura Diosan
Andrei Piscoran
Andreea Tomescu
VGen
59
0
0
29 Apr 2025
Automatic Legal Writing Evaluation of LLMs
Ramon Pires
Roseval Malaquias Junior
Rodrigo Nogueira
AILaw
ELM
83
0
0
29 Apr 2025
Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge
Wenhan Mu
Ling Xu
Shuren Pei
Le Mi
Huichi Zhou
AAML
ELM
53
0
0
28 Apr 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
50
0
0
27 Apr 2025
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
Jiabin Fan
Guoqing Luo
Michael Bowling
Lili Mou
OffRL
68
0
0
26 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
91
2
0
26 Apr 2025
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
37
0
0
25 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
85
0
0
25 Apr 2025
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
Yangxinyu Xie
Bowen Jiang
Tanwi Mallick
Joshua Bergerson
John K Hutchison
...
Robert B. Ross
Yan Feng
L. Levy
Weijie J. Su
Camillo J Taylor
37
1
0
24 Apr 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo
Jinheon Baek
Seongyun Lee
Sung Ju Hwang
AI4CE
44
0
0
24 Apr 2025
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
44
1
0
23 Apr 2025
Med-CoDE: Medical Critique based Disagreement Evaluation Framework
Mohit Gupta
Akiko Aizawa
R. Shah
LM&MA
ELM
32
0
0
21 Apr 2025
Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval
Yong-En Tian
Yu-Chien Tang
Kuang-Da Wang
An-Zi Yen
Wen-Chih Peng
AIFin
49
0
0
19 Apr 2025
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili
Beatrice Savoldi
Matteo Negri
L. Bentivogli
ELM
37
0
0
16 Apr 2025
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
Minqian Liu
Zhiyang Xu
Xinyi Zhang
Heajun An
Sarvech Qadir
...
Pamela J. Wisniewski
Jin-Hee Cho
Sang Won Lee
Ruoxi Jia
Lifu Huang
29
1
0
14 Apr 2025
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
Dayu Yang
Antoine Simoulin
Xin Qian
Xiaoyi Liu
Yuwei Cao
Zhaopu Teng
Grey Yang
LLMAG
59
0
0
11 Apr 2025
LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
Jun Wang
Zhengyuan Zhu
Zeyu Zhang
Chengkai Li
27
0
0
11 Apr 2025
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
40
0
0
11 Apr 2025
From Speech to Summary: A Comprehensive Survey of Speech Summarization
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
46
0
0
10 Apr 2025
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Daniil Larionov
Sotaro Takeshita
Ran Zhang
Yanran Chen
Christoph Leiter
Zhipin Wang
Christian Greisinger
Steffen Eger
ReLM
ELM
LRM
74
1
0
10 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
45
0
0
10 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALM
ELM
49
0
0
09 Apr 2025
ARLO: A Tailorable Approach for Transforming Natural Language Software Requirements into Architecture using LLMs
Tooraj Helmi
35
0
0
08 Apr 2025
FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking
Islam Eldifrawi
Shengrui Wang
Amine Trabelsi
29
0
0
07 Apr 2025
Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching
John Paisley
Wei Zhang
Brian Barr
46
0
0
04 Apr 2025
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
Xiang Wang
Daniil Larionov
Siwei Wu
Yiqi Liu
Steffen Eger
N. Moosavi
Chenghua Lin
ALM
52
1
0
02 Apr 2025
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
ALM
75
0
0
01 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
66
1
0
01 Apr 2025
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models
Rafael Giebisch
Ken E. Friedl
Lev Sorokin
Andrea Stocco
HILM
55
0
0
01 Apr 2025
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
Vivek Iyer
Ricardo Rei
Pinzhen Chen
Alexandra Birch
SyDa
LM&MA
73
0
0
29 Mar 2025
MemInsight: Autonomous Memory Augmentation for LLM Agents
Rana Salama
Jason (Jinglun) Cai
Michelle Yuan
Anna Currey
Monica Sunkara
Yi Zhang
Yassine Benajiba
LLMAG
RALM
89
1
0
27 Mar 2025
JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui
Hanin Atwany
Hamdan Al-Ali
Abdelrahman Mohamed
Ali Mekky
Sergei Tilga
Natalia Fedorova
Ekaterina Artemova
Hanan Aldarmaki
Yova Kementchedjhieva
VLM
51
1
0
27 Mar 2025
1
2
3
4
...
14
15
16
Next