Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.16634
Cited By
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
29 March 2023
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
50 / 757 papers shown
Title
Exploring the Use of LLMs for SQL Equivalence Checking
Rajat Singh
Srikanta J. Bedathur
62
3
0
07 Dec 2024
Explingo: Explaining AI Predictions using Large Language Models
Alexandra Zytek
Sara Pido
Sarah Alnegheimish
Laure Berti-Equille
K. Veeramachaneni
74
1
0
06 Dec 2024
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
Xueqing Wu
Yuheng Ding
Bingxuan Li
Pan Lu
Da Yin
Kai-Wei Chang
Nanyun Peng
LRM
108
3
0
03 Dec 2024
Generating a Low-code Complete Workflow via Task Decomposition and RAG
Orlando Marquez Ayala
Patrice Béchard
65
1
0
29 Nov 2024
Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator
Frederic Kirstein
Terry Ruas
Bela Gipp
92
2
0
27 Nov 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
126
73
0
25 Nov 2024
LLM Augmentations to support Analytical Reasoning over Multiple Documents
Raquib Bin Yousuf
Nicholas Defelice
Mandar Sharma
Shengzhe Xu
Naren Ramakrishnan
66
2
0
25 Nov 2024
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark
Rong-Cheng Tu
Zi-Ao Ma
Tian Lan
Yuehao Zhao
Heyan Huang
Xian-Ling Mao
MLLM
VLM
EGVM
106
4
0
23 Nov 2024
Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems
Hongliu Cao
74
2
0
20 Nov 2024
Neon: News Entity-Interaction Extraction for Enhanced Question Answering
Sneha Singhania
Silviu Cucerzan
Allen Herring
S. Jauhar
KELM
74
0
0
19 Nov 2024
Structured Dialogue System for Mental Health: An LLM Chatbot Leveraging the PM+ Guidelines
Yixiang Chen
Xinyu Zhang
Jinran Wang
Xurong Xie
Nan Yan
Hui Chen
Lan Wang
AI4MH
45
3
0
16 Nov 2024
Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data
Anum Afzal
Juraj Vladika
Gentrit Fazlija
Andrei Staradubets
Florian Matthes
RALM
38
0
0
13 Nov 2024
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Bo Yang
Qingping Yang
Runtao Liu
Runtao Liu
LRM
ReLM
ELM
AIMat
70
1
0
11 Nov 2024
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations
Chaitanya Malaviya
Joseph Chee Chang
Dan Roth
Mohit Iyyer
Mark Yatskar
Kyle Lo
ELM
48
4
0
11 Nov 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia
Liying Cheng
Hou Pong Chan
Chaoqun Liu
Maojia Song
Sharifah Mahani Aljunied
Soujanya Poria
Lidong Bing
RALM
VLM
48
4
0
09 Nov 2024
FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Yilun Zhao
Yitao Long
Yuru Jiang
Chengye Wang
Weiyuan Chen
Hongjun Liu
Yiming Zhang
Xiangru Tang
Chen Zhao
Arman Cohan
VLM
35
1
0
08 Nov 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
38
6
0
07 Nov 2024
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages
Youssef Mohamed
Runjia Li
Ibrahim Said Ahmad
Kilichbek Haydarov
Philip Torr
Kenneth Church
Mohamed Elhoseiny
VLM
38
7
0
06 Nov 2024
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Guan-Ting Lin
Prashanth Gurunath Shivakumar
Aditya Gourav
Yile Gu
Ankur Gandhe
Hung-yi Lee
I. Bulyko
34
8
0
04 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
38
3
0
04 Nov 2024
Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups
Razvan-Alexandru Smadu
David-Gabriel Ion
Dumitru-Clementin Cercel
Florin-Catalin Pop
Mihaela-Claudia Cercel
39
1
0
03 Nov 2024
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALM
LRM
69
0
0
03 Nov 2024
Active Preference-based Learning for Multi-dimensional Personalization
Minhyeon Oh
Seungjoon Lee
Jungseul Ok
31
1
0
01 Nov 2024
IdeaBench: Benchmarking Large Language Models for Research Idea Generation
Sikun Guo
Amir Hassan Shariatmadari
Guangzhi Xiong
Albert Huang
Eric Xie
Stefan Bekiranov
Aidong Zhang
LM&MA
40
8
0
31 Oct 2024
Responsible Retrieval Augmented Generation for Climate Decision Making from Documents
Matyas Juhasz
Kalyan Dutia
Henry Franks
Conor Delahunty
Patrick Fawbert Mills
Harrison Pim
34
1
0
31 Oct 2024
On Positional Bias of Faithfulness for Long-form Summarization
David Wan
Jesse Vig
Joey Tianyi Zhou
Chenyu You
HILM
60
4
0
31 Oct 2024
Prove Your Point!: Bringing Proof-Enhancement Principles to Argumentative Essay Generation
Ruiyu Xiao
Lei Wu
Yuhang Gou
Weinan Zhang
Ting Liu
36
0
0
30 Oct 2024
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu
Bowen Shi
Avi Caciularu
Idan Szpektor
Arman Cohan
72
4
0
30 Oct 2024
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs
Zhihao Liu
Chenhui Hu
ALM
ELM
56
1
0
29 Oct 2024
AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline
Dongkyu Kim
Byoungwook Kim
Donggeon Han
Matouš Eibich
40
8
0
28 Oct 2024
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
Yen-Shan Chen
Jing Jin
Peng-Ting Kuo
Chao-Wei Huang
Yun-Nung (Vivian) Chen
30
1
0
28 Oct 2024
SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script
Eunwon Kim
Chanho Park
Buru Chang
24
1
0
28 Oct 2024
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
Dongryeol Lee
Yerin Hwang
Yongil Kim
Joonsuk Park
Kyomin Jung
ELM
78
5
0
28 Oct 2024
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation
Jaechang Kim
Jinmin Goh
Inseok Hwang
Jaewoong Cho
Jungseul Ok
ELM
33
1
0
28 Oct 2024
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs
Clemencia Siro
Yifei Yuan
Mohammad Aliannejadi
Maarten de Rijke
ELM
25
3
0
25 Oct 2024
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
Xiaoqiang Wang
Bang Liu
LLMAG
LM&Ro
LRM
54
6
0
24 Oct 2024
Optimizing the role of human evaluation in LLM-based spoken document summarization systems
Margaret Kroll
Kelsey Kraus
24
2
0
23 Oct 2024
Large Language Models Still Exhibit Bias in Long Text
Wonje Jeung
Dongjae Jeon
Ashkan Yousefpour
Jonghyun Choi
ALM
31
4
0
23 Oct 2024
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage
Taewhoo Lee
Chanwoong Yoon
Kyochul Jang
Donghyeon Lee
Minju Song
Hyunjae Kim
Jaewoo Kang
ELM
35
1
0
22 Oct 2024
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?
Kenza Benkirane
Jackie Kay
Maria Perez-Ortiz
33
0
0
21 Oct 2024
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Xiongtao Zhou
Jie He
Lanyu Chen
Jingyu Li
Haojing Chen
Víctor Gutiérrez-Basulto
Jeff Z. Pan
H. Chen
LRM
63
1
0
18 Oct 2024
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Hamed Fayyaz
Raphael Poulain
Rahmatollah Beheshti
40
1
0
18 Oct 2024
IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection
Jielin Song
Siyu Liu
Bin Zhu
Yanghui Rao
38
2
0
17 Oct 2024
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Dilip Venkatesh
Raj Dabre
Anoop Kunchukuttan
Mitesh M. Khapra
ELM
40
1
0
17 Oct 2024
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
F. S. Bao
Miaoran Li
Renyi Qu
Ge Luo
Erana Wan
...
Ruixuan Tu
Chenyu Xu
Matthew Gonzales
Ofer Mendelevitch
Amin Ahmad
VLM
HILM
28
3
0
17 Oct 2024
Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations
Aryan Shrivastava
Jessica Hullman
Max Lamparth
45
6
0
17 Oct 2024
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Yuzhe Yang
Yifei Zhang
Yan Hu
Y. Guo
Ruoli Gan
...
Haining Wang
Qianqian Xie
Jimin Huang
Honghai Yu
Benyou Wang
ELM
AIFin
42
2
0
17 Oct 2024
From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization
Catarina G. Belem
Pouya Pezeskhpour
Hayate Iso
Seiji Maekawa
Nikita Bhutani
Estevam R. Hruschka
HILM
75
2
0
17 Oct 2024
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Nandan Thakur
Suleman Kazi
Ge Luo
Jimmy J. Lin
Amin Ahmad
VLM
RALM
28
7
0
17 Oct 2024
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
Xiaonan Jing
Srinivas Billa
Danny Godbout
HILM
45
0
0
16 Oct 2024
Previous
1
2
3
4
5
...
14
15
16
Next