ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.16594
  4. Cited By
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
v1v2v3v4v5 (latest)

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

25 November 2024
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
Zhen Tan
Amrita Bhattacharjee
Yuxuan Jiang
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
    ELMAILaw
ArXiv (abs)PDFHTML

Papers citing "From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge"

50 / 95 papers shown
Title
AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions
AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions
Ihor Pysmennyi
Roman Kyslyi
Kyrylo Kleshch
5
0
0
19 Jun 2025
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
Lucile Favero
Daniel Frases
Juan Antonio Pérez-Ortiz
Tanja Kaser
Nuria Oliver
ELMLRM
20
0
0
17 Jun 2025
A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis
A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis
Bruno Martins
Piotr Szymañski
Piotr Gramacki
30
0
0
17 Jun 2025
BOW: Bottlenecked Next Word Exploration
BOW: Bottlenecked Next Word Exploration
Ming shen
Zhikun Xu
Xiao Ye
Jacob Dineen
Ben Zhou
OffRLLRM
35
0
0
16 Jun 2025
A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis
A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis
Hui Wei
Dong Yoon Lee
Shubham Rohal
Zhizhang Hu
Shiwei Fang
Shijia Pan
40
0
0
13 Jun 2025
TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review
TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review
Yuan Chang
Ziyue Li
Hengyuan Zhang
Yuanbo Kong
Yanru Wu
Zhijiang Guo
Ngai Wong
40
1
0
09 Jun 2025
BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
Saptarshi Sengupta
Shuhua Yang
Paul Kwong Yu
Fali Wang
Suhang Wang
56
0
0
06 Jun 2025
RewardAnything: Generalizable Principle-Following Reward Models
RewardAnything: Generalizable Principle-Following Reward Models
Zhuohao Yu
Jiali Zeng
Weizheng Gu
Yidong Wang
Jindong Wang
Fandong Meng
Jie Zhou
Yue Zhang
Shikun Zhang
Wei Ye
LRM
123
1
0
04 Jun 2025
Beyond the Surface: Measuring Self-Preference in LLM Judgments
Beyond the Surface: Measuring Self-Preference in LLM Judgments
Zhi-Yuan Chen
Hao Wang
Xinyu Zhang
Enrui Hu
Yankai Lin
65
0
0
03 Jun 2025
Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs
Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs
Manon Reusens
Bart Baesens
David Jurgens
69
0
0
03 Jun 2025
DPO Learning with LLMs-Judge Signal for Computer Use Agents
Man Luo
David Cobbley
Xin Su
Shachar Rosenman
Vasudev Lal
Shao-Yen Tseng
Phillip Howard
51
0
0
03 Jun 2025
LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment
LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment
Lingyao Li
Dawei Li
Zhenhui Ou
Xiaoran Xu
Jingxiao Liu
Zihui Ma
Runlong Yu
Min Deng
AI4CE
26
0
0
02 Jun 2025
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification
LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification
Shuzhou Yuan
Ercong Nie
Lukas Kouba
Ashish Yashwanth Kangen
Helmut Schmid
Hinrich Schütze
Michael Färber
75
0
0
02 Jun 2025
Judging LLMs on a Simplex
Judging LLMs on a Simplex
Patrick Vossler
Fan Xia
Yifan Mai
Jean Feng
65
0
0
28 May 2025
A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems
A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems
Jonghan Lim
Ilya Kovalenko
AI4CE
47
0
0
28 May 2025
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Keheliya Gallaba
Ali Arabat
Dayi Lin
Mohammed Sayagh
Ahmed E. Hassan
AI4CE
67
0
0
27 May 2025
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels
Jiaming Ji
Sitong Fang
Wenjing Cao
Jiahao Li
Xuyao Wang
Juntao Dai
Chi-Min Chan
Sirui Han
Yike Guo
Yaodong Yang
LRM
37
0
0
26 May 2025
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Guang Yang
Yu Zhou
Xiang Chen
Wei-Shi Zheng
Xing Hu
Xin Zhou
David Lo
Taolue Chen
ALMLRM
90
0
0
26 May 2025
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Pingzhi Li
Zhen Tan
Huaizhi Qu
Huan Liu
Tianlong Chen
AAML
54
0
0
26 May 2025
Multi-Domain Explainability of Preferences
Multi-Domain Explainability of Preferences
Nitay Calderon
Liat Ein-Dor
Roi Reichart
LRM
58
0
0
26 May 2025
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Sahana Ramnath
Anurag Mudgil
Brihi Joshi
Skyler Hallinan
Xiang Ren
52
0
0
26 May 2025
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Qian Cao
Xiting Wang
Yuzhuo Yuan
Yahui Liu
Fang Luo
Ruihua Song
53
0
0
25 May 2025
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
Ruichen Zhang
Rana Muhammad Shahroz Khan
Zhen Tan
Dawei Li
Song Wang
Tianlong Chen
LRM
65
0
0
24 May 2025
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts
Seon Gyeom Kim
Jae Young Choi
Ryan Rossi
Eunyee Koh
Tak Yeon Lee
123
0
0
23 May 2025
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Licheng Pan
Yongqi Tong
Xin Zhang
Xiaolu Zhang
Jun Zhou
Zhixuan Chu
59
0
0
23 May 2025
MuseRAG: Idea Originality Scoring At Scale
MuseRAG: Idea Originality Scoring At Scale
Ali Sarosh Bangash
Krish Veera
Ishfat Abrar Islam
Raiyan Abdul Baten
LRM
64
0
0
22 May 2025
ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
Razvan-Gabriel Dumitru
Darius Peteleaza
Vikas Yadav
Liangming Pan
ReLMLRM
115
1
0
22 May 2025
Can Large Language Models Understand Internet Buzzwords Through User-Generated Content
Can Large Language Models Understand Internet Buzzwords Through User-Generated Content
Chen Huang
Junkai Luo
Xinzuo Wang
Wenqiang Lei
Jiancheng Lv
86
0
0
21 May 2025
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
Yuxuan Jiang
Dawei Li
Frank Ferraro
LRM
175
1
0
20 May 2025
How Reliable is Multilingual LLM-as-a-Judge?
How Reliable is Multilingual LLM-as-a-Judge?
Xiyan Fu
Wei Liu
ELM
61
0
0
18 May 2025
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches
Yuhang Zhou
Xutian Chen
Yixin Cao
Yuchen Ni
Yu He
...
Xiang Liu
Jian Zhang
Chuanjun Ji
Guangnan Ye
Xipeng Qiu
ELM
61
0
0
18 May 2025
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Zhan Peng Lee
Andre Lin
Calvin Tan
RALMHILM
87
0
0
16 May 2025
"There Is No Such Thing as a Dumb Question," But There Are Good Ones
"There Is No Such Thing as a Dumb Question," But There Are Good Ones
Minjung Shin
Donghyun Kim
Jeh-Kwang Ryu
ELM
66
0
0
15 May 2025
LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
Takumi Shibata
Yuichi Miyamura
127
0
0
13 May 2025
SymPlanner: Deliberate Planning in Language Models with Symbolic Representation
SymPlanner: Deliberate Planning in Language Models with Symbolic Representation
Siheng Xiong
Jieyu Zhou
Zhangding Liu
Yusen Su
LLMAGLM&Ro
459
0
0
02 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
Hongru Wang
Yuxiao Chen
Qingyun Wu
197
7
0
30 Apr 2025
Agree to Disagree? A Meta-Evaluation of LLM Misgendering
Agree to Disagree? A Meta-Evaluation of LLM Misgendering
Arjun Subramonian
Vagrant Gautam
Preethi Seshadri
Dietrich Klakow
Kai-Wei Chang
Ningyu Zhang
97
1
0
23 Apr 2025
Benchmarking LLM-based Relevance Judgment Methods
Benchmarking LLM-based Relevance Judgment Methods
Negar Arabzadeh
Charles L. A. Clarke
100
0
0
17 Apr 2025
Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer
Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer
Huaizhi Qu
Inyoung Choi
Zhen Tan
Song Wang
Sukwon Yun
Qi Long
Faizan Siddiqui
Kwonjoon Lee
Tianlong Chen
78
0
0
17 Apr 2025
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Xiaotian Zhang
Ruizhe Chen
Yang Feng
Zuozhu Liu
109
2
0
17 Apr 2025
A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment
A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment
Negar Arabzadeh
Charles L. A. Clarke
82
3
0
16 Apr 2025
Deep Reasoning Translation via Reinforcement Learning
Deep Reasoning Translation via Reinforcement Learning
Jiaan Wang
Fandong Meng
Jie Zhou
OffRLLRM
123
1
0
14 Apr 2025
Heimdall: test-time scaling on the generative verification
Heimdall: test-time scaling on the generative verification
Wenlei Shi
Xing Jin
LRM
131
7
0
14 Apr 2025
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games
Andrés Isaza-Giraldo
Paulo Bala
Lucas Pereira
84
0
0
13 Apr 2025
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
Zongxian Yang
Jiayu Qian
Z. Huang
Kay Chen Tan
LM&MALRM
162
0
0
13 Apr 2025
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Vladislav Mikhailov
Tita Ranveig Enstad
David Samuel
Hans Christian Farsethås
Andrey Kutuzov
Erik Velldal
Lilja Øvrelid
ELM
115
1
0
10 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALMELM
132
0
0
09 Apr 2025
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Liangjie Huang
Dawei Li
Huan Liu
Lu Cheng
LRM
116
0
0
03 Apr 2025
A Survey of Scaling in Large Language Model Reasoning
A Survey of Scaling in Large Language Model Reasoning
Zihan Chen
Song Wang
Zhen Tan
Xingbo Fu
Zhenyu Lei
Peng Wang
Huan Liu
Cong Shen
Jundong Li
LRM
241
2
0
02 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Hongliu Cao
Ilias Driouich
Robin Singh
Eoin Thomas
ELM
97
0
0
01 Apr 2025
12
Next