ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05685
  4. Cited By
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

9 June 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
Yonghao Zhuang
Zi Lin
Zhuohan Li
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
    ALM
    OSLM
    ELM
ArXivPDFHTML

Papers citing "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

40 / 2,990 papers shown
Title
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Jifan Yu
Xiaozhi Wang
Shangqing Tu
S. Cao
Daniel Zhang-Li
...
Lei Hou
Zhiyuan Liu
Bin Xu
Jie Tang
Juanzi Li
ELM
ALM
41
67
0
15 Jun 2023
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
  Vision-Language Models
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Peng Xu
Wenqi Shao
Kaipeng Zhang
Peng Gao
Shuo Liu
Meng Lei
Fanqing Meng
Siyuan Huang
Yu Qiao
Ping Luo
ELM
MLLM
44
159
0
15 Jun 2023
MiniLLM: Knowledge Distillation of Large Language Models
MiniLLM: Knowledge Distillation of Large Language Models
Yuxian Gu
Li Dong
Furu Wei
Minlie Huang
ALM
58
77
0
14 Jun 2023
Model Spider: Learning to Rank Pre-Trained Models Efficiently
Model Spider: Learning to Rank Pre-Trained Models Efficiently
Yi-Kai Zhang
Ting Huang
Yao-Xiang Ding
De-Chuan Zhan
Han-Jia Ye
41
25
0
06 Jun 2023
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and
  Generative Fusion
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
Dongfu Jiang
Xiang Ren
Bill Yuchen Lin
ELM
24
283
0
05 Jun 2023
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Shalev Lifshitz
Keiran Paster
Harris Chan
Jimmy Ba
Sheila A. McIlraith
LM&Ro
45
68
0
01 Jun 2023
LLaVA-Med: Training a Large Language-and-Vision Assistant for
  Biomedicine in One Day
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Chunyuan Li
Cliff Wong
Sheng Zhang
Naoto Usuyama
Haotian Liu
Jianwei Yang
Tristan Naumann
Hoifung Poon
Jianfeng Gao
LM&MA
MedIm
65
714
0
01 Jun 2023
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Q. V. Liao
Ziang Xiao
ALM
ELM
68
30
0
01 Jun 2023
Large Language Models are not Fair Evaluators
Large Language Models are not Fair Evaluators
Peiyi Wang
Lei Li
Liang Chen
Zefan Cai
Dawei Zhu
Binghuai Lin
Yunbo Cao
Qi Liu
Tianyu Liu
Zhifang Sui
ALM
56
527
0
29 May 2023
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu
Barlas Oğuz
Changsheng Zhao
Ernie Chang
Pierre Stock
Yashar Mehdad
Yangyang Shi
Raghuraman Krishnamoorthi
Vikas Chandra
MQ
65
194
0
29 May 2023
Lawyer LLaMA Technical Report
Lawyer LLaMA Technical Report
Quzhe Huang
Mingxu Tao
Chen Zhang
Zhenwei An
Cong Jiang
Zhibin Chen
Zirui Wu
Yansong Feng
ELM
ALM
AILaw
58
50
0
24 May 2023
In-Context Impersonation Reveals Large Language Models' Strengths and
  Biases
In-Context Impersonation Reveals Large Language Models' Strengths and Biases
Leonard Salewski
Stephan Alaniz
Isabel Rio-Torto
Eric Schulz
Zeynep Akata
49
151
0
24 May 2023
Automatic Model Selection with Large Language Models for Reasoning
Automatic Model Selection with Large Language Models for Reasoning
Xu Zhao
Yuxi Xie
Kenji Kawaguchi
Junxian He
Qizhe Xie
ReLM
LRM
42
39
0
23 May 2023
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Da Yin
Xiao Liu
Fan Yin
Ming Zhong
Hritik Bansal
Jiawei Han
Kai-Wei Chang
ALM
42
37
0
23 May 2023
QTSumm: Query-Focused Summarization over Tabular Data
QTSumm: Query-Focused Summarization over Tabular Data
Yilun Zhao
Zhenting Qi
Linyong Nan
Boyu Mi
Yixin Liu
...
Ruizhe Chen
Xiangru Tang
Yumo Xu
Dragomir R. Radev
Arman Cohan
RALM
LMTD
46
1
0
23 May 2023
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
  Feedback
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
Wenda Xu
Danqing Wang
Liangming Pan
Zhenqiao Song
Markus Freitag
Wenjie Wang
Lei Li
ALM
ELM
46
18
0
23 May 2023
Enhancing Large Language Models Against Inductive Instructions with
  Dual-critique Prompting
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
Rui Wang
Hongru Wang
Fei Mi
Yi Chen
Boyang Xue
Kam-Fai Wong
Rui-Lan Xu
53
13
0
23 May 2023
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large
  Language Models in Knowledge Conflicts
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts
Jian Xie
Kai Zhang
Jiangjie Chen
Renze Lou
Yu-Chuan Su
RALM
227
159
0
22 May 2023
AlpacaFarm: A Simulation Framework for Methods that Learn from Human
  Feedback
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Yann Dubois
Xuechen Li
Rohan Taori
Tianyi Zhang
Ishaan Gulrajani
Jimmy Ba
Carlos Guestrin
Percy Liang
Tatsunori B. Hashimoto
ALM
53
556
0
22 May 2023
CLASS: A Design Framework for building Intelligent Tutoring Systems
  based on Learning Science principles
CLASS: A Design Framework for building Intelligent Tutoring Systems based on Learning Science principles
Shashank Sonkar
Lucy Liu
D. B. Mallick
Richard G. Baraniuk
80
39
0
22 May 2023
DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated
  Text Detection
DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection
Xiao Yu
Yuang Qi
Kejiang Chen
Guoqiang Chen
Xi Yang
Pengyuan Zhu
Xiuwei Shang
Weiming Zhang
Neng H. Yu
DeLMO
28
11
0
21 May 2023
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
Xiaotian Zhang
Chun-yan Li
Yi Zong
Zhengyu Ying
Liang He
Xipeng Qiu
ALM
ELM
32
99
0
21 May 2023
InstructIE: A Bilingual Instruction-based Information Extraction Dataset
InstructIE: A Bilingual Instruction-based Information Extraction Dataset
Honghao Gui
Shuofei Qiao
Jintian Zhang
Hongbin Ye
Mengshu Sun
Lei Liang
Jeff Z. Pan
Huajun Chen
Ningyu Zhang
39
7
0
19 May 2023
Automatic Evaluation of Attribution by Large Language Models
Automatic Evaluation of Attribution by Large Language Models
Xiang Yue
Boshi Wang
Ziru Chen
Kai Zhang
Yu-Chuan Su
Huan Sun
ALM
LRM
HILM
43
55
0
10 May 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
229
581
0
03 May 2023
A Comprehensive Evaluation of Neural SPARQL Query Generation from
  Natural Language Questions
A Comprehensive Evaluation of Neural SPARQL Query Generation from Natural Language Questions
Papa Abdou Karim Karou Diallo
Samuel Reyd
Amal Zouaq
16
7
0
16 Apr 2023
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Haoran Li
Dadi Guo
Wei Fan
Mingshi Xu
Jie Huang
Fanpu Meng
Yangqiu Song
SILM
75
327
0
11 Apr 2023
Instruction Tuning with GPT-4
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
171
591
0
06 Apr 2023
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu
Dan Iter
Yichong Xu
Shuohang Wang
Ruochen Xu
Chenguang Zhu
ELM
ALM
LM&MA
98
1,108
0
29 Mar 2023
Error Analysis Prompting Enables Human-Like Translation Evaluation in
  Large Language Models
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
Qingyu Lu
Baopu Qiu
Liang Ding
Liping Xie
Tom Kocmi
Dacheng Tao
LRM
ALM
ELM
31
109
0
24 Mar 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
441
3,045
0
22 Mar 2023
SAINE: Scientific Annotation and Inference Engine of Scientific Research
SAINE: Scientific Annotation and Inference Engine of Scientific Research
Susie Xi Rao
Yi-Lin Tu
P. Egger
32
1
0
28 Feb 2023
Guiding Large Language Models via Directional Stimulus Prompting
Guiding Large Language Models via Directional Stimulus Prompting
Zekun Li
Baolin Peng
Pengcheng He
Michel Galley
Jianfeng Gao
Xi Yan
LLMAG
LRM
LM&Ro
45
96
0
22 Feb 2023
Using In-Context Learning to Improve Dialogue Safety
Using In-Context Learning to Improve Dialogue Safety
Nicholas Meade
Spandana Gella
Devamanyu Hazarika
Prakhar Gupta
Di Jin
Siva Reddy
Yang Liu
Dilek Z. Hakkani-Tür
41
38
0
02 Feb 2023
Quality at the Tail of Machine Learning Inference
Quality at the Tail of Machine Learning Inference
Zhengxin Yang
Wanling Gao
Chunjie Luo
Lei Wang
Fei Tang
Xu Wen
Jianfeng Zhan
40
1
0
25 Dec 2022
Ontologically Faithful Generation of Non-Player Character Dialogues
Ontologically Faithful Generation of Non-Player Character Dialogues
Nathaniel Weir
Ryan Thomas
Randolph DÁmore
Kellie Hill
Benjamin Van Durme
Harsh Jhamtani
36
6
0
20 Dec 2022
Defending Against Disinformation Attacks in Open-Domain Question
  Answering
Defending Against Disinformation Attacks in Open-Domain Question Answering
Orion Weller
Aleem Khan
Nathaniel Weir
Dawn J Lawrie
Benjamin Van Durme
AAML
76
4
0
20 Dec 2022
Can large language models reason about medical questions?
Can large language models reason about medical questions?
Valentin Liévin
C. Hother
Andreas Geert Motzfeldt
Ole Winther
ELM
LM&MA
AI4MH
LRM
62
301
0
17 Jul 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
454
12,178
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
452
8,727
0
28 Jan 2022
Previous
123...585960