Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.04132
Cited By
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
7 March 2024
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
Dacheng Li
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference"
50 / 340 papers shown
Title
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory
Hongli Zhou
Hui Huang
Ziqing Zhao
Lvyuan Han
Huicheng Wang
...
Jian Dong
Bing Xu
Conghui Zhu
Hailong Cao
Tiejun Zhao
ALM
12
0
0
21 May 2025
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Yu Ying Chiu
Zhilin Wang
Sharan Maiya
Yejin Choi
Kyle Fish
Sydney Levine
Evan Hubinger
17
0
0
20 May 2025
LLM-based Query Expansion Fails for Unfamiliar and Ambiguous Queries
Kenya Abe
Kunihiro Takeoka
Makoto P. Kato
Masafumi Oyamada
28
0
0
19 May 2025
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
Debarpan Bhattacharya
Apoorva Kulkarni
Sriram Ganapathy
22
0
0
19 May 2025
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
Pedro M. P. Curvo
LLMAG
26
0
0
19 May 2025
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches
Yuhang Zhou
Xutian Chen
Yixin Cao
Yuchen Ni
Yu He
...
Xiang Liu
Jian Zhang
Chuanjun Ji
Guangnan Ye
Xipeng Qiu
ELM
24
0
0
18 May 2025
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking
Changlun Li
Yao Shi
Chen Wang
Qiqi Duan
Runke Ruan
Weijie Huang
Haonan Long
Lijun Huang
Yuyu Luo
Nan Tang
AIFin
45
0
0
16 May 2025
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
Pawin Taechoyotin
Daniel Acuna
LRM
38
0
0
16 May 2025
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models
Jian Wu
Cong Wang
TianHuang Su
Jun Yang
Haozhi Lin
...
Steve Yang
BinQing Pan
Zehan Li
Ni Yang
ZhenYu Yang
ALM
21
0
0
16 May 2025
WorldPM: Scaling Human Preference Modeling
Binghai Wang
Runji Lin
Keming Lu
Le Yu
Zizhuo Zhang
...
Xuanjing Huang
Yu-Gang Jiang
Bowen Yu
Jingren Zhou
Junyang Lin
48
0
0
15 May 2025
Evaluations at Work: Measuring the Capabilities of GenAI in Use
Brandon Lepine
Gawesha Weerantunga
Juho Kim
Pamela Mishkin
Matthew Beane
24
0
0
15 May 2025
Empirically evaluating commonsense intelligence in large language models with large-scale human judgments
Tuan Dung Nguyen
Duncan J. Watts
Mark E. Whiting
ELM
38
0
0
15 May 2025
Evaluating LLM Metrics Through Real-World Capabilities
Justin K Miller
Wenjia Tang
ELM
ALM
55
0
0
13 May 2025
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Rushi Qiang
Yuchen Zhuang
Yinghao Li
D. Kilman
Rongzhi Zhang
...
Ian Shu-Hei Wong
Sherry Yang
Percy Liang
Chao Zhang
Bo Dai
ELM
51
0
0
12 May 2025
Measuring General Intelligence with Generated Games
Vivek Verma
David Huang
William Chen
Dan Klein
Nicholas Tomlin
ReLM
ELM
LM&MA
LRM
66
1
0
12 May 2025
Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking
Fabrice Harel-Canada
Boran Erol
Connor Choi
J. Liu
Gary Jiarui Song
Nanyun Peng
Amit Sahai
WaLM
40
0
0
11 May 2025
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban
Hiroaki Hayashi
Yingbo Zhou
Jennifer Neville
65
3
0
09 May 2025
am-ELO: A Stable Framework for Arena-based LLM Evaluation
Zirui Liu
Jiatong Li
Yan Zhuang
Qiang Liu
Shuanghong Shen
Jie Ouyang
Mingyue Cheng
Shijin Wang
76
1
0
06 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
60
0
0
05 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELM
LRM
62
3
0
04 May 2025
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs
G. Wang
Zhiwen Chen
Bo Li
Haifeng Xu
285
0
0
02 May 2025
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
D. Sculley
Will Cukierski
Phil Culliton
Sohier Dane
Maggie Demkin
...
Addison Howard
Paul Mooney
Walter Reade
Megan Risdal
Nate Keating
43
1
0
01 May 2025
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang
Ruotian Ma
Qingxuan Jiang
Peisong Wang
Jiaqi Chen
...
Fanghua Ye
Jian Li
Yifan Yang
Zhaopeng Tu
Xiaolong Li
LLMAG
ELM
ALM
164
0
1
01 May 2025
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges
Xiao Xiao
Yu Su
Sijing Zhang
Zhang Chen
Yadong Chen
Tian Liu
47
0
0
30 Apr 2025
ClonEval: An Open Voice Cloning Benchmark
Iwona Christop
Tomasz Kuczyński
Marek Kubis
AuLLM
45
0
0
29 Apr 2025
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Jiageng Wu
Bowen Gu
Ren Zhou
Kevin Xie
Doug Snyder
...
Siyang Song
Jonathan H. Chen
Santiago Romero-Brufau
K. J. Lin
Jie Yang
LM&MA
ELM
103
0
0
28 Apr 2025
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
Sahel Sharifymoghaddam
Shivani Upadhyay
Nandan Thakur
Ronak Pradeep
Jimmy Lin
RALM
40
0
0
28 Apr 2025
Contextual Online Uncertainty-Aware Preference Learning for Human Feedback
Nan Lu
Ethan X. Fang
Junwei Lu
293
0
0
27 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
116
3
0
26 Apr 2025
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
79
0
0
25 Apr 2025
A Model Zoo on Phase Transitions in Neural Networks
Konstantin Schurholt
Léo Meynent
Yefan Zhou
Haiquan Lu
Yaoqing Yang
Damian Borth
70
0
0
25 Apr 2025
How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study
Rendi Chevi
Kentaro Inui
Thamar Solorio
Alham Fikri Aji
269
0
0
23 Apr 2025
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Carlo Siebenschuh
Kyle Hippe
Ozan Gokdemir
Alexander Brace
A. Khan
...
V. Vishwanath
R. Stevens
Arvind Ramanathan
Ian Foster
Robert Underwood
MoE
54
0
0
23 Apr 2025
aiXamine: Simplified LLM Safety and Security
Fatih Deniz
Dorde Popovic
Yazan Boshmaf
Euisuh Jeong
M. Ahmad
Sanjay Chawla
Issa M. Khalil
ELM
89
0
0
21 Apr 2025
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
Manya Wadhwa
Zayne Sprague
Chaitanya Malaviya
Philippe Laban
Junyi Jessy Li
Greg Durrett
64
0
0
21 Apr 2025
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Hang Zhang
Jiuchen Shi
Yixiao Wang
Quan Chen
Yizhou Shan
Minyi Guo
49
0
0
19 Apr 2025
Probing and Inducing Combinational Creativity in Vision-Language Models
Yongqian Peng
Yuxi Ma
Mengmeng Wang
Yuxuan Wang
Yizhou Wang
Chuxu Zhang
Yixin Zhu
Zilong Zheng
MLLM
CoGe
89
0
0
17 Apr 2025
A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment
Negar Arabzadeh
Charles L. A. Clarke
51
3
0
16 Apr 2025
Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective
Qiaosi Wang
Xuhui Zhou
Maarten Sap
Jodi Forlizzi
Hong Shen
47
0
0
15 Apr 2025
The Structural Safety Generalization Problem
Julius Broomfield
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Tia Nasir
Jason Zhang
Reihaneh Iranmanesh
Sara Pieri
Reihaneh Rabbany
Kellin Pelrine
AAML
42
0
0
13 Apr 2025
Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models
Vishakh Padmakumar
Chen Yueh-Han
Jane Pan
Valerie Chen
He He
47
0
0
13 Apr 2025
DRAFT-ing Architectural Design Decisions using LLMs
Rudra Dhar
Adyansh Kakran
Amey Karan
Karthik Vaidhyanathan
Vasudeva Varma
46
0
0
11 Apr 2025
Large Language Models Could Be Rote Learners
Yuyang Xu
Renjun Hu
Haochao Ying
Jian Wu
Xing Shi
Wei Lin
ELM
234
0
0
11 Apr 2025
FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion
Longguang Zhong
Fanqi Wan
Ziyi Yang
Guosheng Liang
Tianyuan Shi
Xiaojun Quan
MoMe
66
0
0
09 Apr 2025
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
45
0
0
09 Apr 2025
One-Minute Video Generation with Test-Time Training
Karan Dalal
Daniel Koceja
Gashon Hussein
Jiarui Xu
Yue Zhao
...
Tatsunori Hashimoto
Sanmi Koyejo
Yejin Choi
Yu Sun
Xiaolong Wang
ViT
93
6
0
07 Apr 2025
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
Weiwei Sun
Shengyu Feng
Shanda Li
Yiming Yang
LLMAG
52
2
0
06 Apr 2025
ArxivBench: Can LLMs Assist Researchers in Conducting Research?
Ning Li
Jingran Zhang
Justin Cui
31
0
0
06 Apr 2025
How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
Yusen Wu
Junwu Xiong
Xiaotie Deng
LLMAG
52
0
0
04 Apr 2025
Evaluating AI Recruitment Sourcing Tools by Human Preference
Vladimir Slaykovskiy
Maksim Zvegintsev
Yury Sakhonchyk
Hrachik Ajamian
46
0
0
03 Apr 2025
1
2
3
4
5
6
7
Next