Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.05965
Cited By
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels
13 March 2025
Luke M. Guerdan
Solon Barocas
Kenneth Holstein
Hanna M. Wallach
Zhiwei Steven Wu
Alexandra Chouldechova
ALM
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Validating LLM-as-a-Judge Systems in the Absence of Gold Labels"
20 / 20 papers shown
Title
Judging LLMs on a Simplex
Patrick Vossler
Fan Xia
Yifan Mai
Jean Feng
48
0
0
28 May 2025
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
Manav Chaudhary
Harshit Gupta
Savita Bhat
Vasudeva Varma
AAML
91
2
0
12 Dec 2024
Auto-Evaluation with Few Labels through Post-hoc Regression
Benjamin Eyre
David Madras
144
4
0
19 Nov 2024
Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties
Fahim Faisal
Md. Mushfiqur Rahman
Antonios Anastasopoulos
44
4
0
17 Nov 2024
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?
Rikiya Takehi
E. Voorhees
Tetsuya Sakai
I. Soboroff
245
3
0
11 Nov 2024
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks
Annalisa Szymanski
Noah Ziems
Heather A. Eicher-Miller
Tao Li
Meng Jiang
Ronald A Metoyer
ALM
ELM
88
27
0
26 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
118
10
0
17 Oct 2024
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
Haoming Lu
Feifei Zhong
66
1
0
12 Oct 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
151
65
0
18 Jun 2024
AutoEval Done Right: Using Synthetic Data for Model Evaluation
Pierre Boyeau
Anastasios Nikolas Angelopoulos
N. Yosef
Jitendra Malik
Michael I. Jordan
SyDa
85
22
0
09 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
...
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
160
599
0
07 Mar 2024
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Dongping Chen
Ruoxi Chen
Shilin Zhang
Yinuo Liu
Yaochen Wang
Huichi Zhou
Qihui Zhang
Yao Wan
Pan Zhou
Lichao Sun
ELM
56
123
0
07 Feb 2024
Ragas: Automated Evaluation of Retrieval Augmented Generation
ES Shahul
Jithin James
Luis Espinosa-Anke
Steven Schockaert
143
196
0
26 Sep 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
458
4,444
0
09 Jun 2023
Prediction-Powered Inference
Anastasios Nikolas Angelopoulos
Stephen Bates
Clara Fannjiang
Michael I. Jordan
Tijana Zrnic
212
103
0
23 Jan 2023
Eliciting and Learning with Soft Labels from Every Annotator
Katherine M. Collins
Umang Bhatt
Adrian Weller
75
46
0
02 Jul 2022
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future
Jan-Christoph Klie
Bonnie Webber
Iryna Gurevych
96
46
0
05 Jun 2022
Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on Toxicity Annotation
Nitesh Goyal
Ian D Kivlichan
Rachel Rosen
Lucy Vasserman
82
93
0
01 May 2022
What Can We Learn from Collective Human Opinions on Natural Language Inference Data?
Yixin Nie
Xiang Zhou
Joey Tianyi Zhou
93
138
0
07 Oct 2020
Human uncertainty makes classification more robust
Joshua C. Peterson
Ruairidh M. Battleday
Thomas Griffiths
Olga Russakovsky
OOD
64
306
0
19 Aug 2019
1