ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.04132
  4. Cited By
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

7 March 2024
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
Dacheng Li
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
    OSLM
ArXivPDFHTML

Papers citing "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference"

50 / 340 papers shown
Title
Understanding Reference Policies in Direct Preference Optimization
Understanding Reference Policies in Direct Preference Optimization
Yixin Liu
Pengfei Liu
Arman Cohan
49
7
0
18 Jul 2024
Combining Constraint Programming Reasoning with Large Language Model
  Predictions
Combining Constraint Programming Reasoning with Large Language Model Predictions
Florian Régin
Elisabetta De Maria
Alexandre Bonlarron
74
3
0
18 Jul 2024
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang
Bo Li
Peiyuan Zhang
Fanyi Pu
Joshua Adrian Cahyono
...
Shuai Liu
Yuanhan Zhang
Jingkang Yang
Chunyuan Li
Ziwei Liu
101
79
0
17 Jul 2024
How Are LLMs Mitigating Stereotyping Harms? Learning from Search Engine
  Studies
How Are LLMs Mitigating Stereotyping Harms? Learning from Search Engine Studies
Alina Leidinger
Richard Rogers
46
5
0
16 Jul 2024
AstroMLab 1: Who Wins Astronomy Jeopardy!?
AstroMLab 1: Who Wins Astronomy Jeopardy!?
Yuan-Sen Ting
Tuan Dung Nguyen
Tirthankar Ghosal
Rui Pan
Hardik Arora
...
Tijmen de Haan
Nesar Ramachandra
Azton Wells
Sandeep Madireddy
Alberto Accomazzi
OOD
26
4
0
15 Jul 2024
Qwen2 Technical Report
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLM
VLM
MU
70
833
0
15 Jul 2024
Virtual Personas for Language Models via an Anthology of Backstories
Virtual Personas for Language Models via an Anthology of Backstories
Suhong Moon
Marwa Abdulhai
Minwoo Kang
Joseph Suh
Widyadewi Soedarmadji
Eran Kohen Behar
David M. Chan
49
12
0
09 Jul 2024
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via
  Knowledge-Enhanced Logical Reasoning
R2R^2R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Yue Liu
LRM
67
13
0
08 Jul 2024
On Speeding Up Language Model Evaluation
On Speeding Up Language Model Evaluation
Jin Peng Zhou
Christian K. Belardi
Ruihan Wu
Travis Zhang
Carla P. Gomes
Wen Sun
Kilian Q. Weinberger
63
1
0
08 Jul 2024
How do you know that? Teaching Generative Language Models to Reference
  Answers to Biomedical Questions
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Bojana Bašaragin
Adela Ljajić
Darija Medvecki
Lorenzo Cassano
Milos Kosprdic
Nikola Milosevic
LM&MA
45
3
0
06 Jul 2024
Generalists vs. Specialists: Evaluating Large Language Models for Urdu
Generalists vs. Specialists: Evaluating Large Language Models for Urdu
Samee Arif
Abdul Hameed Azeemi
Agha Ali Raza
Awais Athar
ALM
LM&MA
ELM
63
4
0
05 Jul 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
Zhimin Zhao
A. A. Bangash
F. Côgo
Bram Adams
Ahmed E. Hassan
76
1
0
04 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
55
10
0
03 Jul 2024
Learning to Refine with Fine-Grained Natural Language Feedback
Learning to Refine with Fine-Grained Natural Language Feedback
Manya Wadhwa
Xinyu Zhao
Junyi Jessy Li
Greg Durrett
42
12
0
02 Jul 2024
GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning
GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning
Hasna Chouikhi
Manel Aloui
Cyrine Ben Hammou
Ghaith Chaabane
Haithem Kchaou
Chehir Dhaouadi
49
0
0
02 Jul 2024
Compare without Despair: Reliable Preference Evaluation with Generation
  Separability
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh
Tejas Srinivasan
Swabha Swayamdipta
56
2
0
02 Jul 2024
KV Cache Compression, But What Must We Give in Return? A Comprehensive
  Benchmark of Long Context Capable Approaches
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
Jiayi Yuan
Hongyi Liu
Shaochen
Zhong
Yu-Neng Chuang
...
Hongye Jin
Vipin Chaudhary
Zhaozhuo Xu
Zirui Liu
Xia Hu
53
18
0
01 Jul 2024
Too Late to Train, Too Early To Use? A Study on Necessity and Viability
  of Low-Resource Bengali LLMs
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz
Satak Kumar Dey
Ruwad Naswan
Hasnaen Adil
Khondker Salman Sayeed
Haz Sameen Shahgir
49
0
0
29 Jun 2024
GraphArena: Evaluating and Exploring Large Language Models on Graph Computation
GraphArena: Evaluating and Exploring Large Language Models on Graph Computation
Jianheng Tang
Qifan Zhang
Yuhan Li
Nuo Chen
Jia Li
52
3
0
29 Jun 2024
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White
Samuel Dooley
Manley Roberts
Arka Pal
Ben Feuer
...
Willie Neiswanger
Micah Goldblum
Tom Goldstein
Willie Neiswanger
Micah Goldblum
ELM
55
13
0
27 Jun 2024
LLMs instead of Human Judges? A Large Scale Empirical Study across 20
  NLP Evaluation Tasks
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
...
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
66
64
0
26 Jun 2024
Methodology of Adapting Large English Language Models for Specific
  Cultural Contexts
Methodology of Adapting Large English Language Models for Specific Cultural Contexts
Wenjing Zhang
Siqi Xiao
Xuejiao Lei
Rongjia Du
Huazheng Zhang
Meijuan An
Bikun Yang
Zhaoxiang Liu
Kai Wang
Shiguo Lian
ALM
29
0
0
26 Jun 2024
Towards LLM-Powered Ambient Sensor Based Multi-Person Human Activity
  Recognition
Towards LLM-Powered Ambient Sensor Based Multi-Person Human Activity Recognition
Xi Chen
Julien Cumin
F. Ramparany
Dominique Vaufreydaz
55
1
0
25 Jun 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DV
MLLM
62
300
0
24 Jun 2024
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024
  Retrieval-Augmented Generation Track
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
Ronak Pradeep
Nandan Thakur
Sahel Sharifymoghaddam
Eric Zhang
Ryan Nguyen
Daniel Campos
Nick Craswell
Jimmy Lin
61
13
0
24 Jun 2024
AutoDetect: Towards a Unified Framework for Automated Weakness Detection
  in Large Language Models
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Jiale Cheng
Yida Lu
Xiaotao Gu
Pei Ke
Xiao-Yang Liu
Yuxiao Dong
Hongning Wang
Jie Tang
Minlie Huang
55
4
0
24 Jun 2024
Does Cross-Cultural Alignment Change the Commonsense Morality of
  Language Models?
Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models?
Yuu Jinnai
72
1
0
24 Jun 2024
AudioBench: A Universal Benchmark for Audio Large Language Models
AudioBench: A Universal Benchmark for Audio Large Language Models
Bin Wang
Xunlong Zou
Geyu Lin
Siyang Song
Zhuohan Liu
Wenyu Zhang
Zhengyuan Liu
AiTi Aw
Nancy F. Chen
AuLLM
ELM
LM&MA
92
23
0
23 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement
  on Multilingual and Multi-Cultural Data
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
56
9
0
21 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
58
61
0
20 Jun 2024
Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks
Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks
Dan S. Nielsen
Kenneth Enevoldsen
Peter Schneider-Kamp
ELM
55
4
0
19 Jun 2024
DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversation Systems
DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversation Systems
J. Kim
Woosog Chay
Hyeonji Hwang
Daeun Kyung
Hyunseung Chung
Eunbyeol Cho
Yohan Jo
Edward Choi
LLMAG
47
1
0
19 Jun 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
70
57
0
18 Jun 2024
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark
Hiromi Wakaki
Yuki Mitsufuji
Yoshinori Maeda
Yukiko Nishimura
Silin Gao
Mengjie Zhao
Keiichi Yamada
Antoine Bosselut
64
0
0
17 Jun 2024
Evaluating the Performance of Large Language Models via Debates
Evaluating the Performance of Large Language Models via Debates
Behrad Moniri
Hamed Hassani
Yan Sun
ELM
ALM
66
5
0
16 Jun 2024
What If We Recaption Billions of Web Images with LLaMA-3?
What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li
Haoqin Tu
Mude Hui
Zeyu Wang
Bingchen Zhao
...
Jieru Mei
Qing Liu
Huangjie Zheng
Yuyin Zhou
Cihang Xie
VLM
MLLM
49
36
0
12 Jun 2024
Designing a Dashboard for Transparency and Control of Conversational AI
Designing a Dashboard for Transparency and Control of Conversational AI
Yida Chen
Aoyu Wu
Trevor DePodesta
Catherine Yeh
Kenneth Li
...
Jan Riecke
Shivam Raval
Olivia Seow
Martin Wattenberg
Fernanda Viégas
63
17
0
12 Jun 2024
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks
Justin Zhao
Flor Miriam Plaza del Arco
Amanda Cercas Curry
Amanda Cercas Curry
ELM
ALM
60
1
0
12 Jun 2024
Annotation alignment: Comparing LLM and human annotations of
  conversational safety
Annotation alignment: Comparing LLM and human annotations of conversational safety
Rajiv Movva
Pang Wei Koh
Emma Pierson
ALM
57
3
0
10 Jun 2024
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Anthony Costarelli
Mat Allen
Roman Hauksson
Grace Sodunke
Suhas Hariharan
Carlson Cheng
Wenjie Li
Joshua Clymer
Arjun Yadav
ELM
ReLM
LLMAG
LRM
54
20
0
07 Jun 2024
GenAI Arena: An Open Evaluation Platform for Generative Models
GenAI Arena: An Open Evaluation Platform for Generative Models
Dongfu Jiang
Max Ku
Tianle Li
Yuansheng Ni
Shizhuo Sun
Rongqi Fan
Wenhu Chen
EGVM
46
20
0
06 Jun 2024
Stratified Prediction-Powered Inference for Hybrid Language Model
  Evaluation
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation
Adam Fisch
Joshua Maynez
R. A. Hofer
Bhuwan Dhingra
Amir Globerson
William W. Cohen
57
8
0
06 Jun 2024
Benchmark Data Contamination of Large Language Models: A Survey
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
43
42
0
06 Jun 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language
  Understanding Benchmark
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang
Xueguang Ma
Ge Zhang
Yuansheng Ni
Abhranil Chandra
...
Kai Wang
Alex Zhuang
Rongqi Fan
Xiang Yue
Wenhu Chen
LRM
ELM
66
343
0
03 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
32
40
0
03 Jun 2024
Luna: An Evaluation Foundation Model to Catch Language Model
  Hallucinations with High Accuracy and Low Cost
Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost
Masha Belyi
Robert Friel
Shuai Shao
Atindriyo Sanyal
HILM
RALM
72
6
0
03 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
63
12
0
02 Jun 2024
Exploratory Preference Optimization: Harnessing Implicit
  Q*-Approximation for Sample-Efficient RLHF
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie
Dylan J. Foster
Akshay Krishnamurthy
Corby Rosset
Ahmed Hassan Awadallah
Alexander Rakhlin
56
36
0
31 May 2024
clembench-2024: A Challenging, Dynamic, Complementary, Multilingual
  Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents
clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents
Anne Beyer
Kranti Chalamalasetti
Sherzod Hakimov
Brielen Madureira
P. Sadler
David Schlangen
LLMAG
62
4
0
31 May 2024
Provably Efficient Interactive-Grounded Learning with Personalized
  Reward
Provably Efficient Interactive-Grounded Learning with Personalized Reward
Mengxiao Zhang
Yuheng Zhang
Haipeng Luo
Paul Mineiro
39
0
0
31 May 2024
Previous
1234567
Next