ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05685
  4. Cited By
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

9 June 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
Yonghao Zhuang
Zi Lin
Zhuohan Li
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
    ALM
    OSLM
    ELM
ArXivPDFHTML

Papers citing "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

50 / 2,926 papers shown
Title
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
A. Yuille
Jieneng Chen
LRM
65
1
0
01 May 2025
DeepCritic: Deliberate Critique with Large Language Models
DeepCritic: Deliberate Critique with Large Language Models
Wenkai Yang
Jingwen Chen
Yankai Lin
Ji-Rong Wen
ALM
LRM
35
0
0
01 May 2025
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang
Ruotian Ma
Qingxuan Jiang
Peisong Wang
Jiaqi Chen
...
Fanghua Ye
Jian Li
Yifan Yang
Zhaopeng Tu
Xiaolong Li
LLMAG
ELM
ALM
118
0
1
01 May 2025
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
R. Manuvinakurike
Emanuel Moss
E. A. Watkins
Saurav Sahay
G. Raffa
L. Nachman
LRM
31
0
0
01 May 2025
MINERVA: Evaluating Complex Video Reasoning
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani
Sachit Menon
Ahmet Iscen
Shyamal Buch
Ramin Mehran
...
Yukun Zhu
Carl Vondrick
Mikhail Sirotenko
Cordelia Schmid
Tobias Weyand
60
0
0
01 May 2025
FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation
FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation
Chaitali Bhattacharyya
Yeseong Kim
50
0
0
01 May 2025
An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding
An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding
Xiuwei Shang
Zhenkan Fu
Shaoyin Cheng
Guoqiang Chen
Gangyang Li
Li Hu
Wenbo Zhang
N. Yu
67
0
0
30 Apr 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
Han Wang
Yuxiao Chen
Qingyun Wu
51
1
0
30 Apr 2025
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
Hanhua Hong
Chenghao Xiao
Yang Wang
Y. Liu
Wenge Rong
Chenghua Lin
33
0
0
29 Apr 2025
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Mihai Nadas
Laura Diosan
Andrei Piscoran
Andreea Tomescu
VGen
59
0
0
29 Apr 2025
Computational Reasoning of Large Language Models
Computational Reasoning of Large Language Models
Haitao Wu
Zongbo Han
Joey Tianyi Zhou
Huaxi Huang
Changqing Zhang
ELM
LRM
62
0
0
29 Apr 2025
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks
Rui Wang
Junda Wu
Yu Xia
Tong Yu
Ruiyi Zhang
Ryan Rossi
Lina Yao
Julian McAuley
AAML
SILM
56
0
0
29 Apr 2025
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
Sahel Sharifymoghaddam
Shivani Upadhyay
Nandan Thakur
Ronak Pradeep
Jimmy Lin
RALM
35
0
0
28 Apr 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
84
0
0
28 Apr 2025
AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers
AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers
Zijie Lin
Yiqing Shen
Qilin Cai
He Sun
Jinrui Zhou
Mingjun Xiao
66
0
0
28 Apr 2025
$\texttt{SAGE}$: A Generic Framework for LLM Safety Evaluation
SAGE\texttt{SAGE}SAGE: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
59
0
0
28 Apr 2025
Explanatory Summarization with Discourse-Driven Planning
Explanatory Summarization with Discourse-Driven Planning
Dongqi Liu
Xi Yu
Vera Demberg
Mirella Lapata
55
0
0
27 Apr 2025
Anyprefer: An Agentic Framework for Preference Data Synthesis
Anyprefer: An Agentic Framework for Preference Data Synthesis
Yiyang Zhou
Zhaoxiang Wang
Tianle Wang
Shangyu Xing
Peng Xia
...
Chetan Bansal
Weitong Zhang
Ying Wei
Joey Tianyi Zhou
Huaxiu Yao
71
1
0
27 Apr 2025
Platonic Grounding for Efficient Multimodal Language Models
Platonic Grounding for Efficient Multimodal Language Models
Moulik Choraria
Xinbo Wu
Akhil Bhimaraju
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
Lav Varshney
64
0
0
27 Apr 2025
When2Call: When (not) to Call Tools
When2Call: When (not) to Call Tools
Hayley Ross
Ameya Sunil Mahabaleshwarkar
Yoshi Suhara
101
0
0
26 Apr 2025
Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning
Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning
Abdellah Ghassel
Xianzhi Li
Xiaodan Zhu
58
0
0
26 Apr 2025
MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?
MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?
Zheng Hui
Xiaokai Wei
Yexi Jiang
Kevin Gao
Chen Wang
Frank Ong
Se-eun Yoon
Rachit Pareek
Michelle Gong
LLMAG
71
0
0
26 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
98
2
0
26 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
88
0
0
25 Apr 2025
An Empirical Study of Evaluating Long-form Question Answering
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
37
0
0
25 Apr 2025
A Model Zoo on Phase Transitions in Neural Networks
A Model Zoo on Phase Transitions in Neural Networks
Konstantin Schurholt
Léo Meynent
Yefan Zhou
Haiquan Lu
Yaoqing Yang
Damian Borth
70
0
0
25 Apr 2025
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Jing Liu
Hangyu Guo
Ranjie Duan
Xingyuan Bu
Yancheng He
...
Yingshui Tan
Yanan Wu
Jihao Gu
Heng Chang
Jun Zhu
MLLM
241
0
0
25 Apr 2025
CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality
CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality
Junyan Zhang
Shuliang Liu
Aiwei Liu
Yubo Gao
Jiajun Li
Xiaojie Gu
Xuming Hu
WaLM
68
2
0
24 Apr 2025
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Minju Seo
Jinheon Baek
Seongyun Lee
Sung Ju Hwang
AI4CE
44
1
0
24 Apr 2025
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation
Yangxinyu Xie
Bowen Jiang
Tanwi Mallick
Joshua Bergerson
John K Hutchison
...
Robert B. Ross
Yan Feng
L. Levy
Weijie J. Su
Camillo J Taylor
47
1
0
24 Apr 2025
How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study
How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study
Rendi Chevi
Kentaro Inui
Thamar Solorio
Alham Fikri Aji
216
0
0
23 Apr 2025
Planning with Diffusion Models for Target-Oriented Dialogue Systems
Planning with Diffusion Models for Target-Oriented Dialogue Systems
Hanwen Du
B. Peng
Xia Ning
30
0
0
23 Apr 2025
Process Reward Models That Think
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
46
2
0
23 Apr 2025
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments
Yuante Li
Jama Hussein Mohamud
Chongren Sun
Di Wu
Benoit Boulet
LLMAG
ELM
74
0
0
23 Apr 2025
V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
V2^22R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
Zhiyuan Fan
Yumeng Wang
Sandeep Polisetty
Yi R. Fung
52
0
0
23 Apr 2025
Learning Explainable Dense Reward Shapes via Bayesian Optimization
Learning Explainable Dense Reward Shapes via Bayesian Optimization
Ryan Koo
Ian Yang
Vipul Raheja
Mingyi Hong
Kwang-Sung Jun
Dongyeop Kang
36
0
0
22 Apr 2025
Certified Mitigation of Worst-Case LLM Copyright Infringement
Certified Mitigation of Worst-Case LLM Copyright Infringement
Jingyu Zhang
Jiacan Yu
Marc Marone
Benjamin Van Durme
Daniel Khashabi
MoMe
248
0
0
22 Apr 2025
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Ning Wang
Zihan Yan
W. Li
Chuan Ma
H. Chen
Tao Xiang
AAML
53
0
0
22 Apr 2025
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation
FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation
Chanyeol Choi
Jihoon Kwon
Jaeseon Ha
Hojun Choi
Chaewoon Kim
Yongjae Lee
Jy-yong Sohn
Alejandro Lopez-Lira
RALM
63
0
0
22 Apr 2025
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
Zhifan Ye
Kejing Xia
Yonggan Fu
Xin Dong
Jihoon Hong
Xiangchi Yuan
Shizhe Diao
Jan Kautz
Pavlo Molchanov
Yingyan Lin
Mamba
51
4
0
22 Apr 2025
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Yuxin Jiang
Yufei Wang
Chuhan Wu
Xinyi Dai
Yan Xu
...
Yucheng Wang
Xin Jiang
Lifeng Shang
Ruiming Tang
Wenjie Wang
38
0
0
22 Apr 2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Minghao Wu
Weixuan Wang
Sinuo Liu
Huifeng Yin
Xintong Wang
Yu Zhao
Chenyang Lyu
Longyue Wang
Weihua Luo
Kaifu Zhang
ELM
81
1
0
22 Apr 2025
Synergistic Weak-Strong Collaboration by Aligning Preferences
Synergistic Weak-Strong Collaboration by Aligning Preferences
Yizhu Jiao
Xuchao Zhang
Zhaoyang Wang
Yubo Ma
Zhun Deng
Rujia Wang
Chetan Bansal
Saravan Rajmohan
Jiawei Han
Huaxiu Yao
226
0
0
21 Apr 2025
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq Joty
ELM
ALM
LRM
58
3
0
21 Apr 2025
Trillion 7B Technical Report
Trillion 7B Technical Report
Sungjun Han
Juyoung Suk
Suyeong An
Hyungguk Kim
Kyuseok Kim
Wonsuk Yang
Seungtaek Choi
Jamin Shin
211
1
0
21 Apr 2025
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
Yao Shi
Rongkeng Liang
Yong Xu
LLMAG
AI4Ed
ELM
67
0
0
21 Apr 2025
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
Ronak Pradeep
Nandan Thakur
Shivani Upadhyay
Daniel Fernando Campos
Nick Craswell
Jimmy Lin
38
0
0
21 Apr 2025
Establishing Reliability Metrics for Reward Models in Large Language Models
Establishing Reliability Metrics for Reward Models in Large Language Models
Yizhou Chen
Yawen Liu
Xuesi Wang
Qingtao Yu
Guangda Huzhang
Anxiang Zeng
Han Yu
Zhiming Zhou
45
0
0
21 Apr 2025
Natural Fingerprints of Large Language Models
Natural Fingerprints of Large Language Models
Teppei Suzuki
Ryokan Ri
Sho Takase
33
0
0
21 Apr 2025
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
Manya Wadhwa
Zayne Sprague
Chaitanya Malaviya
Philippe Laban
Junyi Jessy Li
Greg Durrett
40
0
0
21 Apr 2025
Previous
123456...575859
Next