Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.04132
Cited By
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
7 March 2024
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
Dacheng Li
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference"
50 / 340 papers shown
Title
AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
Kristen M. Edwards
Farnaz Tehranchi
Scarlett R. Miller
Faez Ahmed
74
0
0
01 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
68
2
0
01 Apr 2025
Learning a Canonical Basis of Human Preferences from Binary Ratings
Kailas Vodrahalli
Wei Wei
James Zou
61
0
0
31 Mar 2025
On Large Multimodal Models as Open-World Image Classifiers
Alessandro Conti
Massimiliano Mancini
Enrico Fini
Yiming Wang
Paolo Rota
Elisa Ricci
VLM
Presented at
ResearchTrend Connect | VLM
on
07 May 2025
101
0
0
27 Mar 2025
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yong Liu
Zonglin Yang
Tong Xie
Jinjie Ni
Ben Gao
Yuezun Li
Shixiang Tang
Wanli Ouyang
Min Zhang
Dongzhan Zhou
48
5
0
27 Mar 2025
Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead
Viktor Schlegel
Anil A Bharath
Zilong Zhao
Kevin Yee
71
0
0
26 Mar 2025
A multi-agentic framework for real-time, autonomous freeform metasurface design
Robert Lupoiu
Yixuan Shao
Tianxiang Dai
Chenkai Mao
Kofi Edee
Jonathan A. Fan
AI4CE
81
0
0
26 Mar 2025
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
68
1
0
24 Mar 2025
ChatBench: From Static Benchmarks to Human-AI Evaluation
Serina Chang
Ashton Anderson
Jake M. Hofman
ELM
AI4MH
69
2
0
22 Mar 2025
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
Aly M. Kassem
Bernhard Schölkopf
Zhijing Jin
31
0
0
20 Mar 2025
Aligning Text-to-Music Evaluation with Human Preferences
Yichen Huang
Zachary Novack
Koichi Saito
Jiatong Shi
Shinji Watanabe
Yuki Mitsufuji
John Thickstun
Chris Donahue
EGVM
75
1
0
20 Mar 2025
Navigating Rifts in Human-LLM Grounding: Study and Benchmark
Omar Shaikh
Hussein Mozannar
Gagan Bansal
Adam Fourney
Eric Horvitz
86
2
0
18 Mar 2025
MetaScale: Test-Time Scaling with Evolving Meta-Thoughts
Qin Liu
Wenxuan Zhou
Nan Xu
James Y. Huang
Fei Wang
Sheng Zhang
Hoifung Poon
Mengzhao Chen
LLMAG
ReLM
AI4Cl
LRM
107
1
0
17 Mar 2025
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Alexander Pugachev
Alena Fenogenova
Vladislav Mikhailov
Ekaterina Artemova
55
0
0
17 Mar 2025
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Kanzhi Cheng
Wenpo Song
Jiaxin Fan
Zheng Ma
Qiushi Sun
Fangzhi Xu
Chenyang Yan
Nuo Chen
Jianbing Zhang
Jiajun Chen
MLLM
VLM
66
1
0
16 Mar 2025
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
Ivan Kartáč
Mateusz Lango
Ondrej Dusek
ELM
62
1
0
14 Mar 2025
Large language model-powered AI systems achieve self-replication with no human intervention
Xudong Pan
Jiarun Dai
Yihe Fan
Minyuan Luo
Changyi Li
Min Yang
GNN
LRM
59
0
0
14 Mar 2025
Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark
Viktor Moskvoretskii
Alina Lobanova
Ekaterina Neminova
Chris Biemann
Alexander Panchenko
Irina Nikishina
65
0
0
13 Mar 2025
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels
Luke M. Guerdan
Solon Barocas
Kenneth Holstein
Hanna M. Wallach
Zhiwei Steven Wu
Alexandra Chouldechova
ALM
ELM
352
0
0
13 Mar 2025
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Tairan Xu
Leyang Xue
Zhan Lu
Adrian Jackson
Kai Zou
MoE
100
2
0
12 Mar 2025
EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
Zhiyuan Zeng
Yizhong Wang
Hannaneh Hajishirzi
Pang Wei Koh
ELM
68
7
0
11 Mar 2025
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
Siddhant Arora
Yifan Peng
Jiatong Shi
Jinchuan Tian
William Chen
...
Yosuke Kashiwagi
E. Tsunoo
Shuichiro Shimizu
Vaibhav Srivastav
Shinji Watanabe
53
0
0
11 Mar 2025
Enhancing Large Language Models for Hardware Verification: A Novel SystemVerilog Assertion Dataset
Anand Menon
Samit S Miftah
Shamik Kundu
Souvik Kundu
Amisha Srivastava
Arnab Raha
Gabriel Theodor Sonnenschein
Suvadeep Banerjee
Deepak A. Mathaikutty
K. Basu
58
0
0
11 Mar 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
H. A. Alyahya
Haidar Khan
Yazeed Alnumay
M Saiful Bari
B. Yener
LRM
69
1
0
10 Mar 2025
WildIFEval: Instruction Following in the Wild
Gili Lior
Asaf Yehudai
Ariel Gera
L. Ein-Dor
74
0
0
09 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
74
4
0
07 Mar 2025
RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
Tianjun Wei
Wei Wen
Ruizhi Qiao
Xing Sun
Jianghong Ma
ALM
ELM
60
1
0
07 Mar 2025
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
Feng Jiang
Zhiyu Lin
Fan Bu
Yuhao Du
Benyou Wang
Haoyang Li
AuLLM
ELM
106
0
0
07 Mar 2025
SANDWiCH: Semantical Analysis of Neighbours for Disambiguating Words in Context ad Hoc
Daniel Guzman-Olivares
Lara Quijano-Sanchez
Federico Liberatore
48
0
0
07 Mar 2025
Shifting Long-Context LLMs Research from Input to Output
Yuhao Wu
Yushi Bai
Zhiqing Hu
Shangqing Tu
Ming Shan Hee
Juanzi Li
Roy Ka-wei Lee
65
1
0
06 Mar 2025
English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance
Karl Audun Borgersen
76
0
0
05 Mar 2025
Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation
Hongzhi Luan
Changxin Tian
Zhaoxin Huan
Xiaolu Zhang
Kunlong Chen
Qing Cui
Zhiqiang Zhang
48
1
0
02 Mar 2025
Evaluating Polish linguistic and cultural competency in large language models
Sławomir Dadas
Małgorzata Grębowiec
Michał Perełkiewicz
Rafał Poświata
ELM
63
2
0
02 Mar 2025
CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments
Mingcong Lei
Ge Wang
Yiming Zhao
Zhixin Mai
Qing Zhao
Yao Guo
Zhen Li
Shuguang Cui
Yatong Han
J. Ren
LLMAG
55
0
0
02 Mar 2025
Towards Refining Developer Questions using LLM-Based Named Entity Recognition for Developer Chatroom Conversations
Pouya Fathollahzadeh
Mariam El Mezouar
Hao Li
Ying Zou
Ahmed E. Hassan
44
0
0
02 Mar 2025
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
Terry Tong
Fei Wang
Zhe Zhao
Mengzhao Chen
AAML
ELM
49
2
0
01 Mar 2025
WorldModelBench: Judging Video Generation Models As World Models
Dacheng Li
Yunhao Fang
Yukang Chen
Shuo Yang
Shiyi Cao
...
Hongxu Yin
Joseph E. Gonzalez
Ion Stoica
Enze Xie
Yaojie Lu
VGen
62
4
0
28 Feb 2025
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Danae Sánchez Villegas
Ingo Ziegler
Desmond Elliott
LRM
62
1
0
26 Feb 2025
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
Kuang Wang
Xianrui Li
Steve Yang
Li Zhou
Feng Jiang
Haoyang Li
56
0
0
26 Feb 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Yancheng He
Shilong Li
Jing Liu
Weixun Wang
Xingyuan Bu
...
Zhongyuan Peng
Zhenru Zhang
Zhicheng Zheng
Wenbo Su
Bo Zheng
ELM
LRM
86
11
0
26 Feb 2025
ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models
Simeng Han
Frank Palma Gomez
Tu Vu
Zefei Li
Daniel Cer
Hansi Zeng
Chris Tar
Arman Cohan
Gustavo Hernández Ábrego
74
1
0
24 Feb 2025
Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews
Mengqiao Liu
Tevin Wang
Cassandra A. Cohen
Sarah Li
Chenyan Xiong
LRM
79
0
0
24 Feb 2025
OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering
Jiahao Nick Li
Zhuohao Jerry Zhang
Zhang
86
1
0
24 Feb 2025
IPO: Your Language Model is Secretly a Preference Classifier
Shivank Garg
Ayush Singh
Shweta Singh
Paras Chopra
282
1
0
22 Feb 2025
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models
Seonil Son
Ju-Min Oh
Heegon Jin
Cheolhun Jang
Jeongbeom Jeong
Kuntae Kim
70
0
0
20 Feb 2025
Optimizing Model Selection for Compound AI Systems
Lingjiao Chen
Jared Quincy Davis
Boris Hanin
Peter Bailis
Matei A. Zaharia
James Zou
Ion Stoica
86
1
0
20 Feb 2025
Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models
Rubing Li
João Sedoc
Arun Sundararajan
LRM
70
1
0
20 Feb 2025
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
Eva Sánchez Salido
Julio Gonzalo
Guillermo Marco
ELM
79
3
0
18 Feb 2025
What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets
Marco Antonio Stranisci
Christian Hardmeier
80
0
0
17 Feb 2025
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
Cristian Gutierrez
LRM
81
0
0
17 Feb 2025
Previous
1
2
3
4
5
6
7
Next