Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.16634
Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
50 / 747 papers shown
Title
Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects
Tobias Preintner
Weixuan Yuan
Qi Huang
Adrian König
Thomas Bäck
E. Raponi
N. V. Stein
26
0
0
09 May 2025
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
40
0
0
08 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
H. Li
Shangqing Tu
...
Zhiyuan Liu
Huiqin Liu
Lei Hou
Juanzi Li
Bin Xu
24
0
0
04 May 2025
SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation
Tanguy Herserant
Vincent Guigue
ELM
40
0
0
04 May 2025
LookAlike: Consistent Distractor Generation in Math MCQs
Nisarg Parikh
Nigel Fernandez
Alexander Scarlatos
Simon Woodhead
Andrew S. Lan
48
0
0
03 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
J. H. Liu
Zhiguang Han
...
Beibin Li
Chi Wang
H. Wang
Y. Chen
Qingyun Wu
49
0
0
30 Apr 2025
A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces
Juliana Barbosa
Ulhas Gondhali
Gohar Petrossian
Kinshuk Sharma
Sunandan Chakraborty
Jennifer Jacquet
Juliana Freire
31
0
0
29 Apr 2025
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Mihai Nadas
Laura Diosan
Andrei Piscoran
Andreea Tomescu
VGen
57
0
0
29 Apr 2025
Automatic Legal Writing Evaluation of LLMs
Ramon Pires
Roseval Malaquias Junior
Rodrigo Nogueira
AILaw
ELM
81
0
0
29 Apr 2025
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Hanhua Hong
Chenghao Xiao
Yang Wang
Y. Liu
Wenge Rong
Chenghua Lin
26
0
0
29 Apr 2025
JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
Anum Afzal
Alexandre Mercier
Florian Matthes
60
0
0
29 Apr 2025
Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge
Wenhan Mu
Ling Xu
Shuren Pei
Le Mi
Huichi Zhou
AAML
ELM
53
0
0
28 Apr 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
43
0
0
27 Apr 2025
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
Jiabin Fan
Guoqing Luo
Michael Bowling
Lili Mou
OffRL
63
0
0
26 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
1
0
26 Apr 2025
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
32
0
0
25 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
77
0
0
25 Apr 2025
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
Yangxinyu Xie
Bowen Jiang
Tanwi Mallick
Joshua Bergerson
John K Hutchison
...
Robert B. Ross
Yan Feng
L. Levy
Weijie J. Su
Camillo J. Taylor
32
0
0
24 Apr 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo
Jinheon Baek
Seongyun Lee
S. Hwang
AI4CE
39
0
0
24 Apr 2025
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
44
1
0
23 Apr 2025
Med-CoDE: Medical Critique based Disagreement Evaluation Framework
Mohit Gupta
Akiko Aizawa
R. Shah
LM&MA
ELM
30
0
0
21 Apr 2025
Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval
Yong-En Tian
Yu-Chien Tang
Kuang-Da Wang
An-Zi Yen
Wen-Chih Peng
AIFin
44
0
0
19 Apr 2025
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili
Beatrice Savoldi
Matteo Negri
L. Bentivogli
ELM
35
0
0
16 Apr 2025
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
Minqian Liu
Zhiyang Xu
Xinyi Zhang
Heajun An
Sarvech Qadir
...
Pamela J. Wisniewski
Jin-Hee Cho
Sang Won Lee
Ruoxi Jia
Lifu Huang
29
1
0
14 Apr 2025
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
Dayu Yang
Antoine Simoulin
Xin Qian
Xiaoyi Liu
Yuwei Cao
Zhaopu Teng
Grey Yang
LLMAG
56
0
0
11 Apr 2025
LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
H. Zhang
Zhengyuan Zhu
Zeyu Zhang
Chengkai Li
22
0
0
11 Apr 2025
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
35
0
0
11 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
45
0
0
10 Apr 2025
From Speech to Summary: A Comprehensive Survey of Speech Summarization
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
39
0
0
10 Apr 2025
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Daniil Larionov
Sotaro Takeshita
Ran Zhang
Yanran Chen
Christoph Leiter
Zhipin Wang
Christian Greisinger
Steffen Eger
ReLM
ELM
LRM
72
0
0
10 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALM
ELM
42
0
0
09 Apr 2025
ARLO: A Tailorable Approach for Transforming Natural Language Software Requirements into Architecture using LLMs
Tooraj Helmi
23
0
0
08 Apr 2025
FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking
Islam Eldifrawi
Shengrui Wang
Amine Trabelsi
29
0
0
07 Apr 2025
Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching
John Paisley
Wei Zhang
Brian Barr
38
0
0
04 Apr 2025
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
X. Wang
Daniil Larionov
Siwei Wu
Yiqi Liu
Steffen Eger
N. Moosavi
Chenghua Lin
ALM
50
0
0
02 Apr 2025
Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models
Rafael Giebisch
Ken E. Friedl
Lev Sorokin
Andrea Stocco
HILM
50
0
0
01 Apr 2025
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
ALM
70
0
0
01 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
52
1
0
01 Apr 2025
XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation
Vivek Iyer
Ricardo Rei
Pinzhen Chen
Alexandra Birch
SyDa
LM&MA
70
0
0
29 Mar 2025
MemInsight: Autonomous Memory Augmentation for LLM Agents
Rana Salama
Jason (Jinglun) Cai
Michelle Yuan
Anna Currey
Monica Sunkara
Yi Zhang
Yassine Benajiba
LLMAG
RALM
84
1
0
27 Mar 2025
JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui
Hanin Atwany
Hamdan Al-Ali
Abdelrahman Mohamed
Ali Mekky
Sergei Tilga
Natalia Fedorova
Ekaterina Artemova
Hanan Aldarmaki
Yova Kementchedjhieva
VLM
37
1
0
27 Mar 2025
ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback
Taewon Yun
Jihwan Oh
Hyangsuk Min
Yuho Lee
Jihwan Bang
Jason (Jinglun) Cai
Hwanjun Song
OffRL
LRM
39
0
0
27 Mar 2025
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications
Sunayana Sitaram
Adrian de Wynter
Isobel McCrum
Qilong Gu
Si-Qing Chen
AILaw
104
0
0
26 Mar 2025
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews
Huimin Xu
Seungjun Yi
Terence Lim
Jiawei Xu
Andrew Well
...
Y. Zhang
Heng Ji
Keshav Pingali
Yan Leng
Ying Ding
LLMAG
86
0
0
26 Mar 2025
DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts
Ling Zhong
Yujing Lu
Jing Yang
Weiming Li
Peng Wei
Yongheng Wang
Manni Duan
Qing Zhang
45
0
0
25 Mar 2025
Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education
Hayate Iso
Pouya Pezeshkpour
Nikita Bhutani
Estevam R. Hruschka
63
0
0
24 Mar 2025
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya
Yinhong Liu
Ramit Debnath
Anna Korhonen
30
0
0
22 Mar 2025
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
Brihi Joshi
Sriram Venkatapathy
Mohit Bansal
Nanyun Peng
Haw-Shiuan Chang
LRM
49
0
0
21 Mar 2025
ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach
Reem Gody
Mahmoud Goudy
Ahmed Tawfik
SyDa
155
0
0
21 Mar 2025
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
David Wan
Justin Chih-Yao Chen
Elias Stengel-Eskin
Mohit Bansal
LLMAG
LRM
60
1
0
19 Mar 2025
1
2
3
4
...
13
14
15
Next