Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.17631
Cited By
v1
v2 (latest)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
26 October 2023
Lianghui Zhu
Xinggang Wang
Xinlong Wang
ELM
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"JudgeLM: Fine-tuned Large Language Models are Scalable Judges"
50 / 110 papers shown
Title
VLM@school -- Evaluation of AI image understanding on German middle school knowledge
René Peinl
Vincent Tischler
CoGe
VLM
37
0
0
13 Jun 2025
Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation
Tzu-Heng Huang
Harit Vishwakarma
Frederic Sala
ELM
127
0
0
12 Jun 2025
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
Songze Li
Chuokun Xu
Jiaying Wang
Xueluan Gong
Chen Chen
J. Zhang
Jun Wang
K. Lam
Shouling Ji
AAML
ELM
87
0
0
11 Jun 2025
Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models
Haoyu Wang
Peihao Wang
Mufei Li
Shikun Liu
Siqi Miao
Zhangyang Wang
P. Li
20
0
0
09 Jun 2025
Quantitative LLM Judges
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
...
Jennifer Healey
Nedim Lipka
Ryan Rossi
Uttaran Bhattacharya
Branislav Kveton
ELM
59
0
0
03 Jun 2025
Beyond the Surface: Measuring Self-Preference in LLM Judgments
Zhi-Yuan Chen
Hao Wang
Xinyu Zhang
Enrui Hu
Yankai Lin
63
0
0
03 Jun 2025
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training
Keyeun Lee
Seolhee Lee
Esther Hehsun Kim
Yena Ko
Jinsu Eun
...
Haiyi Zhu
Robert E. Kraut
Eunyoung Suh
Eun-mee Kim
Hajin Lim
24
0
0
31 May 2025
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Sahana Ramnath
Anurag Mudgil
Brihi Joshi
Skyler Hallinan
Xiang Ren
52
0
0
26 May 2025
Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
Chiyu Ma
Enpei Zhang
Yilun Zhao
Wenjun Liu
Yaning Jia
Peijun Qing
Lin Shi
Arman Cohan
Yujun Yan
Soroush Vosoughi
LLMAG
ELM
60
0
0
26 May 2025
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Zhuo Liu
Moxin Li
Xun Deng
Qifan Wang
Fuli Feng
ELM
74
0
0
25 May 2025
Flex-Judge: Think Once, Judge Anywhere
Jongwoo Ko
S. Kim
Sungwoo Cho
Se-Young Yun
ELM
LRM
218
0
0
24 May 2025
ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction
Yan Yu
Yilun Liu
Minggui He
Shimin Tao
Weibin Meng
...
Li Zhang
Hongxia Ma
Chang Su
Hao Yang
Fuliang Li
42
0
0
23 May 2025
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Jiwon Moon
Yerin Hwang
Dongryeol Lee
Taegwan Kang
Yongil Kim
Kyomin Jung
ELM
59
0
0
22 May 2025
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
Hongji Yang
Yucheng Zhou
Wencheng Han
Jianbing Shen
29
0
0
22 May 2025
Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge
Yassir Fathullah
Mark Gales
ELM
81
0
0
21 May 2025
Think-J: Learning to Think for Generative LLM-as-a-Judge
Hui Huang
Yancheng He
Hongli Zhou
Rui Zhang
Wei Liu
Weixun Wang
Wenbo Su
Bo Zheng
Jiaheng Liu
LLMAG
AILaw
ELM
LRM
71
1
0
20 May 2025
How Reliable is Multilingual LLM-as-a-Judge?
Xiyan Fu
Wei Liu
ELM
61
0
0
18 May 2025
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Luyu Chen
Zeyu Zhang
Haoran Tan
Quanyu Dai
Hao-ran Yang
Zhenhua Dong
Xu Chen
52
0
0
18 May 2025
Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models
Banca Calvo Figueras
Rodrigo Agerri
ALM
ELM
LRM
185
2
0
16 May 2025
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
Pascal A Scherz
Stephan Goetz
ELM
81
2
0
16 May 2025
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
135
0
0
13 May 2025
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Peichao Lai
Kai Zhang
Yi Lin
Lingling Zhang
Feiyang Ye
...
Zifei Shan
Zeang Sheng
Yansen Wang
Wentao Zhang
Bin Cui
ELM
LRM
178
0
0
12 May 2025
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
Soumik Dey
Hansi Wu
Binbin Li
125
1
0
07 May 2025
am-ELO: A Stable Framework for Arena-based LLM Evaluation
Zirui Liu
Jiatong Li
Yan Zhuang
Qiang Liu
Shuanghong Shen
Jie Ouyang
Mingyue Cheng
Shijin Wang
186
1
0
06 May 2025
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang
Ruotian Ma
Qingxuan Jiang
Peisong Wang
Jiaqi Chen
...
Fanghua Ye
Jian Li
Yifan Yang
Zhaopeng Tu
Xiaolong Li
LLMAG
ELM
ALM
261
0
1
01 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
253
7
0
26 Apr 2025
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
143
9
0
23 Apr 2025
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
Reya Vir
Shreya Shankar
Harrison Chase
Will Fu-Hinthorn
Aditya G. Parameswaran
AI4TS
85
0
0
20 Apr 2025
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Ding Chen
Qingchen Yu
P. Wang
Wentao Zhang
Simin Niu
Feiyu Xiong
Xiaochen Li
Minchuan Yang
Zhiyu Li
ALM
LRM
136
6
0
14 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
101
0
0
10 Apr 2025
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge
Riccardo Cantini
A. Orsino
Massimo Ruggiero
Domenico Talia
AAML
ELM
110
4
0
10 Apr 2025
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
A. Ramesh
Shailesh Nanisetty
Chirag Goel
David Vazquez
Christopher Pal
Spandana Gella
Giuseppe Carenini
I. Laradji
82
0
0
10 Apr 2025
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
Yifei Yu
Qian Zhang
Lingfeng Qiao
Di Yin
Fang Li
Jie Wang
Zheyu Chen
Suncong Zheng
Xiaolong Liang
Xingwu Sun
95
0
0
07 Apr 2025
Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson
Kevin Du
Niklas Stoehr
Serge Belongie
Ryan Cotterell
Nico Lang
Stella Frank
92
2
0
07 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELM
LRM
89
3
0
04 Apr 2025
Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing
Johan Wahréus
Ahmed Mohamed Hussain
P. Papadimitratos
105
0
0
27 Mar 2025
A Multi-Model Adaptation of Speculative Decoding for Classification
Somnath Roy
Padharthi Sreekar
Srivatsa Narasimha
Anubhav Anand
80
0
0
23 Mar 2025
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Alexander Pugachev
Alena Fenogenova
Vladislav Mikhailov
Ekaterina Artemova
111
0
0
17 Mar 2025
GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
Mingyang Song
Mao Zheng
Xuan Luo
LRM
105
0
0
08 Mar 2025
Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
Anar Yeginbergen
Maite Oronoz
Rodrigo Agerri
138
0
0
07 Mar 2025
Is Your Video Language Model a Reliable Judge?
M. Liu
Wensheng Zhang
104
5
0
07 Mar 2025
RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
Tianjun Wei
Wei Wen
Ruizhi Qiao
Xing Sun
Jianghong Ma
ALM
ELM
75
2
0
07 Mar 2025
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting
Jiyue Jiang
Pengan Chen
Jinqiao Wang
Dongchen He
Ziqin Wei
...
Yimin Fan
Xiangyu Shi
Jimeng Sun
Chuan Wu
Yuan Li
LM&MA
121
3
0
06 Mar 2025
OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale
Haoyang Li
Shang Wu
Yanling Wang
Xinmei Huang
Jing Zhang
...
Tieying Zhang
Jianjun Chen
Rui Shi
Hong Chen
Cuiping Li
SyDa
159
9
0
04 Mar 2025
Improving LLM-as-a-Judge Inference with the Judgment Distribution
Victor Wang
Michael J.Q. Zhang
Eunsol Choi
114
4
0
04 Mar 2025
Argument Summarization and its Evaluation in the Era of Large Language Models
Moritz Altemeyer
Steffen Eger
Johannes Daxenberger
Yanran Chen
Tim Altendorf
Philipp Cimiano
Benjamin Schiller
LM&MA
ELM
LRM
122
1
0
02 Mar 2025
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu
Tiezheng YU
Wenjun Hou
Yi Cheng
Liangyou Li
Xin Jiang
Lifeng Shang
Qiang Liu
Wenjie Li
ELM
154
0
0
26 Feb 2025
Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization
Fernando Spadea
Oshani Seneviratne
78
1
0
21 Feb 2025
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Kimia Noorbakhsh
Joseph Chandler
Pantea Karimi
M. Alizadeh
H. Balakrishnan
LRM
109
1
0
18 Feb 2025
Combining Large Language Models with Static Analyzers for Code Review Generation
Imen Jaoua
Oussama Ben Sghaier
Houari Sahraoui
106
2
0
10 Feb 2025
1
2
3
Next