ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.04132
  4. Cited By
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

7 March 2024
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
Dacheng Li
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
    OSLM
ArXivPDFHTML

Papers citing "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference"

50 / 340 papers shown
Title
Measuring and Improving Persuasiveness of Large Language Models
Measuring and Improving Persuasiveness of Large Language Models
Somesh Singh
Yaman Kumar Singla
Harini SI
Balaji Krishnamurthy
45
3
0
03 Oct 2024
EmbedLLM: Learning Compact Representations of Large Language Models
EmbedLLM: Learning Compact Representations of Large Language Models
Richard Zhuang
Tianhao Wu
Zhaojin Wen
Andrew Li
Jiantao Jiao
Kannan Ramchandran
AIFin
42
1
0
03 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
52
2
0
03 Oct 2024
"Oh LLM, I'm Asking Thee, Please Give Me a Decision Tree": Zero-Shot Decision Tree Induction and Embedding with Large Language Models
"Oh LLM, I'm Asking Thee, Please Give Me a Decision Tree": Zero-Shot Decision Tree Induction and Embedding with Large Language Models
Ricardo Knauer
Mario Koddenbrock
Raphael Wallsberger
Nicholas M. Brisson
Georg N. Duda
Deborah Falla
David W. Evans
Erik Rodner
53
0
0
27 Sep 2024
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following
  Benchmark
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
Elliot L. Epstein
Kaisheng Yao
Jing Li
Xinyi Bai
Hamid Palangi
LRM
47
0
0
26 Sep 2024
AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking
AI Policy Projector: Grounding LLM Policy Design in Iterative Mapmaking
Michelle S. Lam
Fred Hohman
Dominik Moritz
Jeffrey P. Bigham
Kenneth Holstein
Mary Beth Kery
40
1
0
26 Sep 2024
A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K
  Filings Using Large Language Models
A Scalable Data-Driven Framework for Systematic Analysis of SEC 10-K Filings Using Large Language Models
Syed Affan Daimi
Asma Iqbal
44
1
0
26 Sep 2024
Archon: An Architecture Search Framework for Inference-Time Techniques
Archon: An Architecture Search Framework for Inference-Time Techniques
Jon Saad-Falcon
Adrian Gamarra Lafuente
Shlok Natarajan
Nahum Maru
Hristo Todorov
...
E. Kelly Buchanan
Mayee Chen
Neel Guha
Christopher Ré
Azalia Mirhoseini
AI4CE
33
20
0
23 Sep 2024
Aligning Language Models Using Follow-up Likelihood as Reward Signal
Aligning Language Models Using Follow-up Likelihood as Reward Signal
Chen Zhang
Dading Chong
Feng Jiang
Chengguang Tang
Anningzhe Gao
Guohua Tang
Haizhou Li
ALM
38
2
0
20 Sep 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido
Roser Morante
Julio Gonzalo
Guillermo Marco
Jorge Carrillo-de-Albornoz
...
Enrique Amigó
Andrés Fernández
Alejandro Benito-Santos
Adrián Ghajari Espinosa
Victor Fresno
ELM
58
0
0
19 Sep 2024
Qwen2.5-Coder Technical Report
Qwen2.5-Coder Technical Report
Binyuan Hui
Jian Yang
Zeyu Cui
Jiaxi Yang
Dayiheng Liu
...
Fei Huang
Xingzhang Ren
Xuancheng Ren
Jingren Zhou
Junyang Lin
OSLM
81
227
0
18 Sep 2024
The Factuality of Large Language Models in the Legal Domain
The Factuality of Large Language Models in the Legal Domain
Rajaa El Hamdani
Thomas Bonald
Fragkiskos D. Malliaros
Nils Holzenberger
Fabian M. Suchanek
AILaw
HILM
51
0
0
18 Sep 2024
From Lists to Emojis: How Format Bias Affects Model Alignment
From Lists to Emojis: How Format Bias Affects Model Alignment
Xuanchang Zhang
Wei Xiong
Lichang Chen
Dinesh Manocha
Heng Huang
Tong Zhang
ALM
47
11
0
18 Sep 2024
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
Guijin Son
Hyunwoo Ko
Hoyoung Lee
Yewon Kim
Seunghyeok Hong
ALM
ELM
75
8
0
17 Sep 2024
Evaluating the Impact of Compression Techniques on Task-Specific
  Performance of Large Language Models
Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
Bishwash Khanal
Jeffery M. Capone
53
1
0
17 Sep 2024
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with
  Customisable Fairness Calibration
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration
Xin Guan
Nathaniel Demchak
Saloni Gupta
Ze Wang
Ediz Ertekin Jr.
Adriano Soares Koshiyama
Emre Kazim
Zekun Wu
59
3
0
17 Sep 2024
Do Large Language Models Need a Content Delivery Network?
Do Large Language Models Need a Content Delivery Network?
Yihua Cheng
Kuntai Du
Jiayi Yao
Junchen Jiang
KELM
49
5
0
16 Sep 2024
Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Ruiqi Zhong
Heng Wang
Dan Klein
Jacob Steinhardt
44
6
0
13 Sep 2024
Understanding Foundation Models: Are We Back in 1924?
Understanding Foundation Models: Are We Back in 1924?
Alan F. Smeaton
AI4CE
53
2
0
11 Sep 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
58
3
0
10 Sep 2024
DetoxBench: Benchmarking Large Language Models for Multitask Fraud &
  Abuse Detection
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection
Joymallya Chakraborty
Wei Xia
Anirban Majumder
Dan Ma
Walid Chaabene
Naveed Janvekar
21
3
0
09 Sep 2024
Assessing SPARQL capabilities of Large Language Models
Assessing SPARQL capabilities of Large Language Models
Lars-Peter Meyer
Johannes Frey
Felix Brei
Natanael Arndt
48
6
0
09 Sep 2024
Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large
  Language Models Attentive Readers?
Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?
Neeladri Bhuiya
Viktor Schlegel
Stefan Winkler
LRM
45
5
0
08 Sep 2024
Sparse Rewards Can Self-Train Dialogue Agents
Sparse Rewards Can Self-Train Dialogue Agents
B. Lattimer
Varun Gangal
Ryan McDonald
Yi Yang
LLMAG
51
2
0
06 Sep 2024
RAG based Question-Answering for Contextual Response Prediction System
RAG based Question-Answering for Contextual Response Prediction System
Sriram Veturi
Saurabh Vaichal
Reshma Lal Jagadheesh
Nafis Irtiza Tripto
Nian Yan
RALM
48
5
0
05 Sep 2024
Towards a Unified View of Preference Learning for Large Language Models:
  A Survey
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Bofei Gao
Feifan Song
Yibo Miao
Zefan Cai
Zhiyong Yang
...
Houfeng Wang
Zhifang Sui
Peiyi Wang
Baobao Chang
Baobao Chang
61
12
0
04 Sep 2024
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through
  Corpus Retrieval and Augmentation
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Ingo Ziegler
Abdullatif Köksal
Desmond Elliott
Hinrich Schütze
58
5
0
03 Sep 2024
Report Cards: Qualitative Evaluation of Language Models Using Natural
  Language Summaries
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang
Fuyang Cui
Keiran Paster
Jimmy Ba
Pashootan Vaezipoor
Silviu Pitis
Michael Ruogu Zhang
48
1
0
01 Sep 2024
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck
Maximilian Baader
Martin Vechev
ALM
94
0
0
01 Sep 2024
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Zhikai Li
Xuewen Liu
Dongrong Fu
Jianquan Li
Qingyi Gu
Kurt Keutzer
Zhen Dong
EGVM
VGen
DiffM
95
2
0
26 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
68
25
0
23 Aug 2024
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
John Mendonça
Isabel Trancoso
A. Lavie
ALM
47
1
0
20 Aug 2024
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware
  Academic Reviews
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews
Keith Tyser
Ben Segev
Gaston Longhitano
Xin-Yu Zhang
Zachary Meeks
...
Nicholas Belsten
A. Shporer
Madeleine Udell
Dov Te’eni
Iddo Drori
56
15
0
19 Aug 2024
GoNoGo: An Efficient LLM-based Multi-Agent System for Streamlining
  Automotive Software Release Decision-Making
GoNoGo: An Efficient LLM-based Multi-Agent System for Streamlining Automotive Software Release Decision-Making
Arsham Gholamzadeh Khoee
Yinan Yu
R. Feldt
Andris Freimanis
Patrick Andersson
Dhasarathy Parthasarathy
49
2
0
19 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak
  Attacks
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
49
2
0
18 Aug 2024
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Ravi Raju
Swayambhoo Jain
Bo Li
Jonathan Li
Urmish Thakker
ALM
ELM
69
11
0
16 Aug 2024
The Future of Open Human Feedback
The Future of Open Human Feedback
Shachar Don-Yehiya
Ben Burtenshaw
Ramon Fernandez Astudillo
Cailean Osborne
Mimansa Jaiswal
...
Omri Abend
Jennifer Ding
Sara Hooker
Hannah Rose Kirk
Leshem Choshen
VLM
ALM
67
4
0
15 Aug 2024
Anchored Preference Optimization and Contrastive Revisions: Addressing
  Underspecification in Alignment
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Karel DÓosterlinck
Winnie Xu
Chris Develder
Thomas Demeester
A. Singh
Christopher Potts
Douwe Kiela
Shikib Mehri
48
12
0
12 Aug 2024
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large
  Language Models
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models
Fabio Pernisi
Dirk Hovy
Paul Röttger
56
0
0
08 Aug 2024
MPC-Minimized Secure LLM Inference
MPC-Minimized Secure LLM Inference
Deevashwer Rathee
Dacheng Li
Ion Stoica
Hao Zhang
Raluca A. Popa
52
1
0
07 Aug 2024
Conditioning LLMs with Emotion in Neural Machine Translation
Conditioning LLMs with Emotion in Neural Machine Translation
Charles Brazier
Jean-Luc Rouas
CVBM
51
2
0
06 Aug 2024
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language
  Models
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
Muxi Diao
Rumei Li
Shiyang Liu
Guogang Liao
Jingang Wang
Xunliang Cai
Weiran Xu
AAML
59
1
0
05 Aug 2024
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
Kunlun Zhu
Yifan Luo
Dingling Xu
Ruobing Wang
Shi Yu
...
Yishan Li
Zhiyuan Liu
Xu Han
Zhiyuan Liu
Maosong Sun
58
18
0
02 Aug 2024
How to Measure the Intelligence of Large Language Models?
How to Measure the Intelligence of Large Language Models?
Nils Korber
Silvan Wehrli
Christopher Irrgang
ELM
ALM
59
0
0
30 Jul 2024
Evaluating Large Language Models for automatic analysis of teacher
  simulations
Evaluating Large Language Models for automatic analysis of teacher simulations
David de-Fitero-Dominguez
Mariano Albaladejo-González
Antonio Garcia-Cabot
Eva García-López
Antonio Moreno-Cediel
Erin Barno
Justin Reich
ELM
33
0
0
29 Jul 2024
VolDoGer: LLM-assisted Datasets for Domain Generalization in
  Vision-Language Tasks
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
Juhwan Choi
Junehyoung Kwon
Jungmin Yun
Seunguk Yu
Youngbin Kim
53
1
0
29 Jul 2024
Benchmarks as Microscopes: A Call for Model Metrology
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Y. Wang
Naomi Saphra
53
10
0
22 Jul 2024
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval
  Augmented Question Answering
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Rujun Han
Yuhao Zhang
Peng Qi
Yumo Xu
Jenyuan Wang
Lan Liu
William Yang Wang
Bonan Min
Vittorio Castelli
RALM
45
19
0
19 Jul 2024
LLMs as Function Approximators: Terminology, Taxonomy, and Questions for
  Evaluation
LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation
David Schlangen
53
1
0
18 Jul 2024
Baba Is AI: Break the Rules to Beat the Benchmark
Baba Is AI: Break the Rules to Beat the Benchmark
Nathan Cloos
Meagan Jens
Michelangelo Naim
Yen-Ling Kuo
Ignacio Cases
Andrei Barbu
Christopher J. Cueva
VLM
36
1
0
18 Jul 2024
Previous
1234567
Next