Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.01937
Cited By
Can Large Language Models Be an Alternative to Human Evaluations?
3 May 2023
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Can Large Language Models Be an Alternative to Human Evaluations?"
50 / 122 papers shown
Title
Reranking-based Generation for Unbiased Perspective Summarization
Narutatsu Ri
Nicholas Deas
Kathleen McKeown
OffRL
17
0
0
19 Jun 2025
Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Zongxia Li
Yapei Chang
Yuhang Zhou
Xiyang Wu
Zichao Liang
Yoo Yeon Sung
Jordan L. Boyd-Graber
22
0
0
18 Jun 2025
DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs
Arie Cattan
Alon Jacovi
Ori Ram
Jonathan Herzig
Roee Aharoni
Sasha Goldshtein
E. Ofek
Idan Szpektor
Avi Caciularu
26
0
0
10 Jun 2025
Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs
Atahan Özer
Çağatay Yıldız
KELM
23
0
0
08 Jun 2025
Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth
Yichi Zhang
Jinlong Pang
Zhaowei Zhu
Yang Liu
21
1
0
08 Jun 2025
Audio-Aware Large Language Models as Judges for Speaking Styles
Cheng-Han Chiang
Xiaofei Wang
Chung-Ching Lin
Kevin Lin
Linjie Li
...
Y. Qian
Zhendong Wang
Zhengyuan Yang
Hung-yi Lee
Lijuan Wang
AuLLM
55
0
0
06 Jun 2025
ProRefine: Inference-time Prompt Refinement with Textual Feedback
Deepak Pandita
Tharindu Cyril Weerasooriya
A. Shah
Christopher Homan
Wei Wei
LLMAG
ReLM
LRM
145
0
0
05 Jun 2025
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
Isik Baran Sandan
Tu Anh Dinh
Jan Niehues
ELM
91
0
0
04 Jun 2025
Quantitative LLM Judges
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
...
Jennifer Healey
Nedim Lipka
Ryan Rossi
Uttaran Bhattacharya
Branislav Kveton
ELM
54
0
0
03 Jun 2025
Labelling Data with Unknown References
Adrian de Wynter
64
0
0
03 Jun 2025
PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements
Petros Raptopoulos
Giorgos Filandrianos
Maria Lymperaiou
Giorgos Stamou
AILaw
42
0
0
31 May 2025
How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG
Qiming Zeng
Xiao Yan
Hao Luo
Yuhao Lin
Yuxiang Wang
Fangcheng Fu
Bo Du
Quanqing Xu
Jiawei Jiang
10
0
0
31 May 2025
OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software
Lingkai Meng
Yu Shao
Long Yuan
Longbin Lai
Peng Cheng
Wenyuan Yu
Wenjie Zhang
Xuemin Lin
Jingren Zhou
12
0
0
29 May 2025
Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
Mingyu Yu
Wei Wang
Y. X. Wei
Sujuan Qin
AAML
40
0
0
29 May 2025
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators
John Mendonça
A. Lavie
Isabel Trancoso
47
0
0
28 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
109
0
0
20 May 2025
Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects
Tobias Preintner
Weixuan Yuan
Qi Huang
Adrian König
Thomas Bäck
Elena Raponi
Niki van Stein
74
0
0
09 May 2025
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin
Niloofar Mireshghallah
Shuyue Stella Li
Michael Duan
Hyunwoo Kim
Yejin Choi
Yulia Tsvetkov
Sewoong Oh
Pang Wei Koh
146
7
0
28 Apr 2025
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
Jiabin Fan
Guoqing Luo
Michael Bowling
Lili Mou
OffRL
140
0
0
26 Apr 2025
CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models
Feiyang Li
Peng Fang
Zhan Shi
Arijit Khan
Fang Wang
Dan Feng
Weihao Wang
Xin Zhang
Yongjian Cui
ReLM
LRM
112
1
0
18 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
139
0
0
16 Apr 2025
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
124
0
0
11 Apr 2025
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
Dayu Yang
Antoine Simoulin
Xin Qian
Xiaoyi Liu
Yuwei Cao
Zhaopu Teng
Grey Yang
LLMAG
145
0
0
11 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
101
0
0
10 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALM
ELM
126
0
0
09 Apr 2025
Leveraging LLM For Synchronizing Information Across Multilingual Tables
Siddharth Khincha
Tushar Kataria
Ankita Anand
Dan Roth
Vivek Gupta
137
0
0
03 Apr 2025
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace
Oliver Jaffe
Dane Sherburn
James Aung
Jun Shern Chan
...
Benjamin Kinsella
Wyatt Thompson
Johannes Heidecke
Amelia Glaese
Tejal Patwardhan
ALM
ELM
967
23
0
02 Apr 2025
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
Aayush Gautam
Susav Shrestha
Narasimha Annapareddy
119
2
0
28 Mar 2025
FutureGen: LLM-RAG Approach to Generate the Future Work of Scientific Article
Ibrahim Al Azher
Miftahul Jannat Mokarrama
Zhishuai Guo
Sagnik Ray Choudhury
Hamed Alhoori
LLMAG
104
2
0
20 Mar 2025
LLM-Mediated Guidance of MARL Systems
Philipp D. Siedler
Ian Gemp
99
0
0
16 Mar 2025
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Naome A. Etori
Kevin Lu
Randu Karisa
Arturs Kanepajs
LRM
ELM
479
0
0
14 Mar 2025
Is Your Video Language Model a Reliable Judge?
M. Liu
Wensheng Zhang
104
5
0
07 Mar 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
146
1
0
24 Feb 2025
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
Joseph Suh
Erfan Jahanparast
Suhong Moon
Minwoo Kang
Serina Chang
ALM
LM&MA
140
4
0
24 Feb 2025
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
Zhaopeng Feng
Jiayuan Su
Jiamei Zheng
Jiahan Ren
Yan Zhang
Jian Wu
Hongwei Wang
Zuozhu Liu
ELM
273
1
0
21 Feb 2025
Evaluating Large Language Models for Public Health Classification and Extraction Tasks
Joshua Harris
Timothy Laurence
Leo Loman
Fan Grayson
Toby Nonnenmacher
...
Hamish Mohammed
Thomas Finnie
Luke Hounsome
Michael Borowitz
Steven Riley
LM&MA
AI4MH
148
5
0
20 Feb 2025
Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Wenwen Xie
Gray Gwizdz
Dongji Feng
136
0
0
20 Feb 2025
LAMD: Context-driven Android Malware Detection and Classification with LLMs
Xingzhi Qian
Xinran Zheng
Yiling He
Shuo Yang
Lorenzo Cavallaro
152
4
0
18 Feb 2025
Conditioning LLMs to Generate Code-Switched Text
Maite Heredia
Gorka Labaka
Jeremy Barnes
A. Soroa
27
1
0
18 Feb 2025
Towards Reasoning Ability of Small Language Models
Gaurav Srivastava
Shuxiang Cao
Xuan Wang
ReLM
LRM
149
11
0
17 Feb 2025
Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent
Zeyu He
Saniya Naphade
Ting-Hao 'Kenneth' Huang
81
0
0
16 Feb 2025
Image Embedding Sampling Method for Diverse Captioning
Sania Waheed
Na Min An
91
0
0
14 Feb 2025
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
139
0
0
07 Feb 2025
Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
Borui Xu
Yao Chen
Zeyi Wen
Weiguo Liu
Bingsheng He
188
2
0
02 Feb 2025
Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy
Tunazzina Islam
Dan Goldwasser
184
3
0
28 Jan 2025
Natural Language Counterfactual Explanations for Graphs Using Large Language Models
Flavio Giorgi
Cesare Campagnano
Fabrizio Silvestri
Gabriele Tolomei
LRM
127
2
0
28 Jan 2025
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
Yinhong Liu
Han Zhou
Zhijiang Guo
Ehsan Shareghi
Ivan Vulić
Anna Korhonen
Nigel Collier
ALM
205
83
0
20 Jan 2025
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems
Robert Friel
Masha Belyi
Atindriyo Sanyal
150
28
0
17 Jan 2025
LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
Helia Hashemi
J. Eisner
Corby Rosset
Benjamin Van Durme
Chris Kedzie
143
6
0
03 Jan 2025
Geometric-Averaged Preference Optimization for Soft Preference Labels
Hiroki Furuta
Kuang-Huei Lee
Shixiang Shane Gu
Y. Matsuo
Aleksandra Faust
Heiga Zen
Izzeddin Gur
144
13
0
31 Dec 2024
1
2
3
Next