Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.05685
Cited By
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
9 June 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
Yonghao Zhuang
Zi Lin
Zhuohan Li
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"
50 / 2,926 papers shown
Title
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Zeqing Wang
Shiyuan Zhang
Chengpei Tang
Keze Wang
LRM
16
0
0
21 May 2025
DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
Yuhang Zhou
Jing Zhu
Shengyi Qian
Zhuokai Zhao
Xiyao Wang
Xiaoyu Liu
Ming Li
Paiheng Xu
Wei Ai
Furong Huang
22
0
0
21 May 2025
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Weixiang Zhao
Xingyu Sui
Yulin Hu
Jiahe Guo
Haixiao Liu
Biye Li
Yanyan Zhao
Bing Qin
Ting Liu
OffRL
21
0
0
21 May 2025
SEPS: A Separability Measure for Robust Unlearning in LLMs
Wonje Jeung
Sangyeon Yoon
Albert No
MU
VLM
20
0
0
20 May 2025
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs
Jiawen Wang
Pritha Gupta
Ivan Habernal
Eyke Hüllermeier
SILM
AAML
34
0
0
20 May 2025
AutoRev: Automatic Peer Review System for Academic Research Papers
Maitreya Prafulla Chitale
Ketaki Mangesh Shetye
Harshit Gupta
Manav Chaudhary
Vasudeva Varma
7
0
0
20 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
17
0
0
20 May 2025
Think-J: Learning to Think for Generative LLM-as-a-Judge
Hui Huang
Yancheng He
Hongli Zhou
Rui Zhang
Wei Liu
Weixun Wang
Wenbo Su
Bo Zheng
Jiaheng Liu
LLMAG
AILaw
ELM
LRM
17
0
0
20 May 2025
Social Sycophancy: A Broader Understanding of LLM Sycophancy
Myra Cheng
Sunny Yu
Cinoo Lee
Pranav Khadpe
Lujain Ibrahim
Dan Jurafsky
12
0
0
20 May 2025
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Wentao Ma
Weiming Ren
Yiming Jia
Zhuofeng Li
Ping Nie
Ge Zhang
Wenhu Chen
17
0
0
20 May 2025
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs
Darpan Aswal
Siddharth D Jaiswal
AAML
12
0
0
20 May 2025
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Yu Fan
Jingwei Ni
Jakob Merane
Etienne Salimbeni
Yang Tian
...
Mrinmaya Sachan
Alexander Stremitzer
Christoph Engel
Elliott Ash
Joel Niklaus
AILaw
ELM
41
0
0
19 May 2025
AMAQA: A Metadata-based QA Dataset for RAG Systems
Davide Bruni
Marco Avvenuti
Nicola Tonellotto
Maurizio Tesconi
7
0
0
19 May 2025
Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability
Jingyi Ren
Yekun Xu
Xiaolong Wang
Weitao Li
Weizhi Ma
Yang Liu
RALM
17
0
0
19 May 2025
Krikri: Advancing Open Large Language Models for Greek
Dimitris Roussis
Leon Voukoutis
Georgios Paraskevopoulos
Sokratis Sofianopoulos
Prokopis Prokopidis
Vassilis Papavasileiou
Athanasios Katsamanis
Stelios Piperidis
Vassilis Katsouros
ALM
30
0
0
19 May 2025
The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting
Christian Braun
Alexander Lilienbeck
Daniel Mentjukov
AILaw
12
0
0
19 May 2025
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
Narek Maloyan
Bislan Ashinov
Dmitry Namiot
AAML
ELM
19
0
0
19 May 2025
CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs
Guoheng Sun
Ziyao Wang
Bowei Tian
Meng Liu
Zheyu Shen
Shwai He
Yexiao He
Wanghao Ye
Yiting Wang
Ang Li
LRM
17
0
0
19 May 2025
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization
Sunghwan Kim
Dongjin Kang
Taeyoon Kwon
Hyungjoo Chae
Dongha Lee
Jinyoung Yeo
ALM
12
0
0
19 May 2025
What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts
Chenyang Yang
Y. Shi
Qianou Ma
Michael Xieyang Liu
Christian Kastner
Tongshuang Wu
19
0
0
19 May 2025
MR. Judge: Multimodal Reasoner as a Judge
Renjie Pi
Felix Bai
Qibin Chen
Simon Wang
Jiulong Shan
Kieran Liu
Meng Cao
ELM
LRM
24
0
0
19 May 2025
ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents
Navid Madani
Rohini Srihari
ELM
12
0
0
18 May 2025
Beyond Single-Point Judgment: Distribution Alignment for LLM-as-a-Judge
Luyu Chen
Zeyu Zhang
Haoran Tan
Quanyu Dai
Hao-ran Yang
Zhenhua Dong
Xu Chen
17
0
0
18 May 2025
AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections
Xin Yu
Yujia Wang
Jinghui Chen
Lingzhou Xue
27
0
0
18 May 2025
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches
Yuhang Zhou
Xutian Chen
Yixin Cao
Yuchen Ni
Yu He
...
Xiang Liu
Jian Zhang
Chuanjun Ji
Guangnan Ye
Xipeng Qiu
ELM
17
0
0
18 May 2025
SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
Wenqiao Zhu
Ji Liu
Lulu Wang
Jun Wu
Yulun Zhang
24
0
0
18 May 2025
Truth Neurons
Haohang Li
Yupeng Cao
Yangyang Yu
Jordan W. Suchow
Zining Zhu
HILM
MILM
KELM
18
0
0
18 May 2025
How Reliable is Multilingual LLM-as-a-Judge?
Xiyan Fu
Wei Liu
ELM
14
0
0
18 May 2025
Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance
Yufeng Wang
Jinwu Hu
Ziteng Huang
Kunyang Lin
Zitian Zhang
...
Zhuliang Yu
Bin Sun
Xiaofen Xing
Qingfang Zheng
Mingkui Tan
7
0
0
18 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
17
0
0
17 May 2025
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
Guijin Son
Jiwoo Hong
Honglu Fan
Heejeong Nam
Hyunwoo Ko
...
Jinyeop Song
Jinha Choi
Gonçalo Paulo
Youngjae Yu
Stella Biderman
22
0
0
17 May 2025
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Zihao Dongfang
Xu Zheng
Ziqiao Weng
Y. Lyu
Danda Pani Paudel
Luc Van Gool
Kailun Yang
Xuming Hu
LRM
19
0
0
17 May 2025
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
X. Zhang
Zetian Ouyang
Linlin Wang
Gerard de Melo
Zhu Cao
Xiaoling Wang
Ya Zhang
Yanfeng Wang
Liang He
LM&MA
ELM
23
0
0
17 May 2025
ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing
Xuanle Zhao
Xuexin Liu
Haoyue Yang
Xianzhen Luo
Fanhu Zeng
Jianling Li
Qi Shi
Chi Chen
17
0
0
17 May 2025
EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation
Ruobing Yao
Yifei Zhang
Shuang Song
Neng Gao
Chenyang Tu
SILM
12
0
0
16 May 2025
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
Khanh-Tung Tran
Barry O'Sullivan
Hoang D. Nguyen
ELM
LRM
9
0
0
16 May 2025
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
Lingxiao Diao
Xinyue Xu
Wanxuan Sun
Cheng Yang
Zhuosheng Zhang
LLMAG
ALM
ELM
17
0
0
16 May 2025
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Zhan Peng Lee
Andre Lin
Calvin Tan
RALM
HILM
37
0
0
16 May 2025
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Yapei Chang
Yekyung Kim
Michael Krumdick
Amir Zadeh
Chuan Li
Chris Tanner
Mohit Iyyer
ALM
24
0
0
16 May 2025
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training
Feijiang Han
Xiaodong Yu
Jianheng Tang
Lyle Ungar
12
0
0
16 May 2025
Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition
Bo Yue
Shuqi Guo
Kaiyu Hu
Chujiao Wang
Benyou Wang
Kui Jia
Guiliang Liu
LRM
32
0
0
16 May 2025
CAMEO: Collection of Multilingual Emotional Speech Corpora
Iwona Christop
Maciej Czajka
26
0
0
16 May 2025
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models
Jian Wu
Cong Wang
TianHuang Su
Jun Yang
Haozhi Lin
...
Steve Yang
BinQing Pan
Zhiyu Li
Ni Yang
ZhenYu Yang
ALM
21
0
0
16 May 2025
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Ziyi Wang
Jiaqi Zeng
Olivier Delalleau
Hoo-Chang Shin
Felipe Soares
Alexander Bukharin
Ellie Evans
Yi Dong
Oleksii Kuchaiev
29
0
0
16 May 2025
THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering
Udita Patel
Rutu Mulkar
Jay Roberts
Cibi Chakravarthy Senthilkumar
Sujay Gandhi
Xiaofei Zheng
Naumaan Nayyar
Rafael Castrillo
12
0
0
16 May 2025
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
Pascal A Scherz
Stephan Goetz
ELM
30
0
0
16 May 2025
RanDeS: Randomized Delta Superposition for Multi-Model Compression
Hangyu Zhou
Aaron Gokaslan
Volodymyr Kuleshov
Bharath Hariharan
MoMe
32
0
0
16 May 2025
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models
Xiaomin Li
Mingye Gao
Yuexing Hao
Taoran Li
Guangya Wan
Zihan Wang
Yijun Wang
LM&MA
ELM
AI4MH
29
0
0
16 May 2025
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
Pawin Taechoyotin
Daniel Acuna
LRM
17
0
0
16 May 2025
XRAG: Cross-lingual Retrieval-Augmented Generation
Wei Liu
Sony Trenous
Leonardo F. R. Ribeiro
Bill Byrne
Felix Hieber
RALM
31
0
0
15 May 2025
1
2
3
4
...
57
58
59
Next