Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.18796
Cited By
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
29 April 2024
Pat Verga
Sebastian Hofstatter
Sophia Althammer
Yixuan Su
Aleksandra Piktus
Arkady Arkhangorodsky
Minjie Xu
Naomi White
Patrick Lewis
ALM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models"
50 / 67 papers shown
Title
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Dylan Bouchard
Mohit Singh Chauhan
HILM
84
0
0
27 Apr 2025
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
Y. Li
Jama Hussein Mohamud
Chongren Sun
Di Wu
Benoit Boulet
LLMAG
ELM
72
0
0
23 Apr 2025
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
Reya Vir
Shreya Shankar
Harrison Chase
Will Fu-Hinthorn
Aditya G. Parameswaran
AI4TS
32
0
0
20 Apr 2025
Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization
Adithya Pratapa
Teruko Mitamura
RALM
34
0
0
17 Apr 2025
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Xiaotian Zhang
Ruizhe Chen
Yang Feng
Zuozhu Liu
42
0
0
17 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
36
0
0
16 Apr 2025
Benchmarking Vision Language Models on German Factual Data
René Peinl
Vincent Tischler
CoGe
69
0
0
15 Apr 2025
Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification
Joseph Liu
Yoonsoo Nam
Xinyue Cui
Swabha Swayamdipta
56
0
0
13 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
45
0
0
10 Apr 2025
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
ALM
75
0
0
01 Apr 2025
Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs
Cong Duy Vu Hoang
Gioacchino Tangari
Clemence Lanfranchi
Dalu Guo
Paul Cayet
...
Long Duong
Damien Hilloulin
Rhicheek Patra
Sungpack Hong
Hassan Chafi
33
0
0
30 Mar 2025
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya
Yinhong Liu
Ramit Debnath
Anna Korhonen
37
0
0
22 Mar 2025
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
Brihi Joshi
Sriram Venkatapathy
Mohit Bansal
Nanyun Peng
Haw-Shiuan Chang
LRM
51
0
0
21 Mar 2025
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu
Srijan Bansal
Yifei Ming
Semih Yavuz
Chenyu You
ELM
95
3
0
19 Mar 2025
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Jing Zhang
63
0
0
18 Mar 2025
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Hongyu Chen
Seraphina Goldfarb-Tarrant
50
0
0
12 Mar 2025
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
Sher Badshah
Hassan Sajjad
65
1
0
11 Mar 2025
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Michael Krumdick
Charles Lovering
Varshini Reddy
Seth Ebner
Chris Tanner
ALM
ELM
55
2
0
07 Mar 2025
Learning and generalization of robotic dual-arm manipulation of boxes from demonstrations via Gaussian Mixture Models (GMMs)
Qian Ying Lee
Suhas Raghavendra Kulkarni
Kenzhi Iskandar Wong
Lin Yang
Bernardo Noronha
Yongjun Wee
Tzu-Yi Hung
Domenico Campolo
53
0
0
07 Mar 2025
How Do Hackathons Foster Creativity? Towards AI Collaborative Evaluation of Creativity at Scale
Jeanette Falk
Yiyi Chen
Janet Rafner
Mike Zhang
Johannes Bjerva
Alexander Nolte
63
1
0
06 Mar 2025
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
Marthe Ballon
Andres Algaba
Vincent Ginis
LRM
ReLM
44
5
0
24 Feb 2025
Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility
Martin Kuo
Jingyang Zhang
Jianyi Zhang
Minxue Tang
Louis DiValentin
...
William Chen
Amin Hass
Tianlong Chen
Yuxiao Chen
Yiming Li
MU
KELM
51
2
0
24 Feb 2025
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
82
0
0
07 Feb 2025
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems
Robert Friel
Masha Belyi
Atindriyo Sanyal
82
19
0
17 Jan 2025
Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models
Hao Li
C. Bezemer
Ahmed E. Hassan
45
2
0
08 Jan 2025
LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context
Kai Ruan
Xuan Wang
Jixiang Hong
Hao Sun
Yang Liu
Hao Sun
LRM
ELM
41
2
0
23 Dec 2024
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
74
0
0
16 Dec 2024
AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
Fabrizio Davide
Pietro Torre
Andrea Gaggioli
Andrea Gaggioli
ELM
157
0
0
12 Dec 2024
Engagement-Driven Content Generation with Large Language Models
Simone Mungari
Federico Cinus
Marco Minici
Francesco Bonchi
Giuseppe Manco
77
0
0
20 Nov 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia
Liying Cheng
Hou Pong Chan
Chaoqun Liu
Maojia Song
Sharifah Mahani Aljunied
Soujanya Poria
Lidong Bing
RALM
VLM
48
4
0
09 Nov 2024
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction
Natasha Butt
Varun Chandrasekaran
Neel Joshi
Besmira Nushi
Vidhisha Balachandran
34
6
0
29 Oct 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELM
ALM
53
38
0
16 Oct 2024
Black-box Uncertainty Quantification Method for LLM-as-a-Judge
Nico Wagner
Michael Desmond
Rahul Nair
Zahra Ashktorab
Elizabeth M. Daly
Qian Pan
Martin Santillan Cooper
James M. Johnson
Werner Geyer
ELM
UQCV
49
4
0
15 Oct 2024
SkillAggregation: Reference-free LLM-Dependent Aggregation
Guangzhi Sun
Anmol Kagrecha
Potsawee Manakul
Phil Woodland
Mark J. F. Gales
32
0
0
14 Oct 2024
PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency
Preferred Elements
:
Kenshin Abe
Kaizaburo Chubachi
Yasuhiro Fujita
...
Yoshihiko Ozaki
Shotaro Sano
Shuji Suzuki
Tianqi Xu
Toshihiko Yanase
41
0
0
10 Oct 2024
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
Kartik Mehta
Yu-Hsiang Lin
Haw-Shiuan Chang
Shereen Oraby
Sijia Liu
Vivek Subramanian
Tagyoung Chung
Mohit Bansal
Nanyun Peng
56
8
0
09 Oct 2024
EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?
Aakriti Agrawal
Mucong Ding
Zora Che
Chenghao Deng
Anirudh Satheesh
John Langford
Furong Huang
47
4
0
06 Oct 2024
Can Language Models Reason about Individualistic Human Values and Preferences?
Liwei Jiang
Taylor Sorensen
Sydney Levine
Yejin Choi
36
7
0
04 Oct 2024
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation
Jonathan Cook
Tim Rocktaschel
Jakob Foerster
Dennis Aumiller
Alex Wang
ALM
37
10
0
04 Oct 2024
AIME: AI System Optimization via Multiple LLM Evaluators
Bhrij Patel
Souradip Chakraborty
Wesley A Suttle
Mengdi Wang
Amrit Singh Bedi
Dinesh Manocha
29
8
0
04 Oct 2024
What Would You Ask When You First Saw
a
2
+
b
2
=
c
2
a^2+b^2=c^2
a
2
+
b
2
=
c
2
? Evaluating LLM on Curiosity-Driven Questioning
Shashidhar Reddy Javaji
Zining Zhu
ELM
ALM
39
0
0
19 Sep 2024
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
Guijin Son
Hyunwoo Ko
Hoyoung Lee
Yewon Kim
Seunghyeok Hong
ALM
ELM
54
7
0
17 Sep 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
58
3
0
10 Sep 2024
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
Sher Badshah
Hassan Sajjad
ELM
40
9
0
17 Aug 2024
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Ravi Raju
Swayambhoo Jain
Bo Li
Jonathan Li
Urmish Thakker
ALM
ELM
42
11
0
16 Aug 2024
The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
Samee Arif
Sualeha Farid
Abdul Hameed Azeemi
Awais Athar
Agha Ali Raza
LLMAG
24
7
0
16 Aug 2024
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung
Faeze Brahman
Yejin Choi
ALM
44
12
0
25 Jul 2024
Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
Seungyoon Kim
Seungone Kim
AI4Ed
34
0
0
24 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
43
29
0
05 Jul 2024
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Chenyu You
Jimmy Huang
ELM
ALM
31
28
0
04 Jul 2024
1
2
Next