Validating LLM-as-a-Judge Systems in the Absence of Gold Labels

13 March 2025

Papers citing "Validating LLM-as-a-Judge Systems in the Absence of Gold Labels"

20 / 20 papers shown

Title
Judging LLMs on a Simplex Patrick Vossler Fan Xia Yifan Mai Jean Feng 48 0 0 28 May 2025
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations Manav Chaudhary Harshit Gupta Savita Bhat Vasudeva Varma AAML 91 2 0 12 Dec 2024
Auto-Evaluation with Few Labels through Post-hoc Regression Benjamin Eyre David Madras 144 4 0 19 Nov 2024
Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties Fahim Faisal Md. Mushfiqur Rahman Antonios Anastasopoulos 44 4 0 17 Nov 2024
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? Rikiya Takehi E. Voorhees Tetsuya Sakai I. Soboroff 245 3 0 11 Nov 2024
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks Annalisa Szymanski Noah Ziems Heather A. Eicher-Miller Tao Li Meng Jiang Ronald A Metoyer ALM ELM 88 27 0 26 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data Florian E. Dorner Vivian Y. Nastl Moritz Hardt ELM ALM 118 10 0 17 Oct 2024
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset Haoming Lu Feifei Zhong 66 1 0 12 Oct 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Aman Singh Thakur Kartik Choudhary Venkat Srinik Ramayapally Sankaran Vaidyanathan Dieuwke Hupkes ELM ALM 151 65 0 18 Jun 2024
AutoEval Done Right: Using Synthetic Data for Model Evaluation Pierre Boyeau Anastasios Nikolas Angelopoulos N. Yosef Jitendra Malik Michael I. Jordan SyDa 85 22 0 09 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference Wei-Lin Chiang Lianmin Zheng Ying Sheng Anastasios Nikolas Angelopoulos Tianle Li ... Hao Zhang Banghua Zhu Michael I. Jordan Joseph E. Gonzalez Ion Stoica OSLM 160 599 0 07 Mar 2024
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark Dongping Chen Ruoxi Chen Shilin Zhang Yinuo Liu Yaochen Wang Huichi Zhou Qihui Zhang Yao Wan Pan Zhou Lichao Sun ELM 56 123 0 07 Feb 2024
Ragas: Automated Evaluation of Retrieval Augmented Generation ES Shahul Jithin James Luis Espinosa-Anke Steven Schockaert 143 196 0 26 Sep 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Lianmin Zheng Wei-Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu ... Dacheng Li Eric Xing Haotong Zhang Joseph E. Gonzalez Ion Stoica ALM OSLM ELM 458 4,444 0 09 Jun 2023
Prediction-Powered Inference Anastasios Nikolas Angelopoulos Stephen Bates Clara Fannjiang Michael I. Jordan Tijana Zrnic 212 103 0 23 Jan 2023
Eliciting and Learning with Soft Labels from Every Annotator Katherine M. Collins Umang Bhatt Adrian Weller 75 46 0 02 Jul 2022
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future Jan-Christoph Klie Bonnie Webber Iryna Gurevych 96 46 0 05 Jun 2022
Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on Toxicity Annotation Nitesh Goyal Ian D Kivlichan Rachel Rosen Lucy Vasserman 82 93 0 01 May 2022
What Can We Learn from Collective Human Opinions on Natural Language Inference Data? Yixin Nie Xiang Zhou Joey Tianyi Zhou 93 138 0 07 Oct 2020
Human uncertainty makes classification more robust Joshua C. Peterson Ruairidh M. Battleday Thomas Griffiths Olga Russakovsky OOD 64 306 0 19 Aug 2019