ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.05965
  4. Cited By

Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

13 March 2025
Luke M. Guerdan
Solon Barocas
Kenneth Holstein
Hanna M. Wallach
Zhiwei Steven Wu
Alexandra Chouldechova
    ALMELM
ArXiv (abs)PDFHTML

Papers citing "Validating LLM-as-a-Judge Systems in the Absence of Gold Labels"

20 / 20 papers shown
Title
Judging LLMs on a Simplex
Judging LLMs on a Simplex
Patrick Vossler
Fan Xia
Yifan Mai
Jean Feng
48
0
0
28 May 2025
Towards Understanding the Robustness of LLM-based Evaluations under
  Perturbations
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
Manav Chaudhary
Harshit Gupta
Savita Bhat
Vasudeva Varma
AAML
93
2
0
12 Dec 2024
Auto-Evaluation with Few Labels through Post-hoc Regression
Auto-Evaluation with Few Labels through Post-hoc Regression
Benjamin Eyre
David Madras
144
4
0
19 Nov 2024
Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties
Fahim Faisal
Md. Mushfiqur Rahman
Antonios Anastasopoulos
44
4
0
17 Nov 2024
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?
Rikiya Takehi
E. Voorhees
Tetsuya Sakai
I. Soboroff
245
3
0
11 Nov 2024
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in
  Expert Knowledge Tasks
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks
Annalisa Szymanski
Noah Ziems
Heather A. Eicher-Miller
Tao Li
Meng Jiang
Ronald A Metoyer
ALMELM
88
27
0
26 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELMALM
118
10
0
17 Oct 2024
Can Vision-Language Models Replace Human Annotators: A Case Study with
  CelebA Dataset
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
Haoming Lu
Feifei Zhong
66
1
0
12 Oct 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELMALM
151
65
0
18 Jun 2024
AutoEval Done Right: Using Synthetic Data for Model Evaluation
AutoEval Done Right: Using Synthetic Data for Model Evaluation
Pierre Boyeau
Anastasios Nikolas Angelopoulos
N. Yosef
Jitendra Malik
Michael I. Jordan
SyDa
85
22
0
09 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
...
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
160
599
0
07 Mar 2024
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with
  Vision-Language Benchmark
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Dongping Chen
Ruoxi Chen
Shilin Zhang
Yinuo Liu
Yaochen Wang
Huichi Zhou
Qihui Zhang
Yao Wan
Pan Zhou
Lichao Sun
ELM
56
123
0
07 Feb 2024
Ragas: Automated Evaluation of Retrieval Augmented Generation
Ragas: Automated Evaluation of Retrieval Augmented Generation
ES Shahul
Jithin James
Luis Espinosa-Anke
Steven Schockaert
143
196
0
26 Sep 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALMOSLMELM
458
4,444
0
09 Jun 2023
Prediction-Powered Inference
Prediction-Powered Inference
Anastasios Nikolas Angelopoulos
Stephen Bates
Clara Fannjiang
Michael I. Jordan
Tijana Zrnic
212
103
0
23 Jan 2023
Eliciting and Learning with Soft Labels from Every Annotator
Eliciting and Learning with Soft Labels from Every Annotator
Katherine M. Collins
Umang Bhatt
Adrian Weller
75
46
0
02 Jul 2022
Annotation Error Detection: Analyzing the Past and Present for a More
  Coherent Future
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future
Jan-Christoph Klie
Bonnie Webber
Iryna Gurevych
96
46
0
05 Jun 2022
Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on
  Toxicity Annotation
Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on Toxicity Annotation
Nitesh Goyal
Ian D Kivlichan
Rachel Rosen
Lucy Vasserman
82
93
0
01 May 2022
What Can We Learn from Collective Human Opinions on Natural Language
  Inference Data?
What Can We Learn from Collective Human Opinions on Natural Language Inference Data?
Yixin Nie
Xiang Zhou
Joey Tianyi Zhou
93
138
0
07 Oct 2020
Human uncertainty makes classification more robust
Human uncertainty makes classification more robust
Joshua C. Peterson
Ruairidh M. Battleday
Thomas Griffiths
Olga Russakovsky
OOD
64
306
0
19 Aug 2019
1