Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.20451
Cited By
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
26 May 2025
Sahana Ramnath
Anurag Mudgil
Brihi Joshi
Skyler Hallinan
Xiang Ren
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries"
17 / 17 papers shown
Title
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban
Hiroaki Hayashi
Yingbo Zhou
Jennifer Neville
74
10
0
09 May 2025
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
Anikait Singh
Sheryl Hsu
Kyle Hsu
E. Mitchell
Stefano Ermon
Tatsunori Hashimoto
Archit Sharma
Chelsea Finn
SyDa
OffRL
82
2
0
26 Feb 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
163
96
0
25 Nov 2024
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown
Xingzhou Lou
Dong Yan
Wei Shen
Yuzi Yan
Jian Xie
Junge Zhang
128
25
0
01 Oct 2024
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Xin Lai
Zhuotao Tian
Yukang Chen
Senqiao Yang
Xiangru Peng
Jiaya Jia
LRM
106
109
0
26 Jun 2024
WildChat: 1M ChatGPT Interaction Logs in the Wild
Wenting Zhao
Xiang Ren
Jack Hessel
Claire Cardie
Yejin Choi
Yuntian Deng
63
208
0
02 May 2024
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Pat Verga
Sebastian Hofstatter
Sophia Althammer
Yixuan Su
Aleksandra Piktus
Arkady Arkhangorodsky
Minjie Xu
Naomi White
Patrick Lewis
ALM
ELM
75
96
0
29 Apr 2024
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert
Valentina Pyatkin
Jacob Morrison
Lester James V. Miranda
Bill Yuchen Lin
...
Sachin Kumar
Tom Zick
Yejin Choi
Noah A. Smith
Hanna Hajishirzi
ALM
119
250
0
20 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
...
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
94
536
0
07 Mar 2024
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Zhen Qin
R. Jagerman
Kai Hui
Honglei Zhuang
Junru Wu
...
Tianqi Liu
Jialu Liu
Donald Metzler
Xuanhui Wang
Michael Bendersky
ALM
RALM
69
235
0
30 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
236
4,186
0
09 Jun 2023
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
Chen Zhang
L. F. D’Haro
Qiquan Zhang
Thomas Friedrichs
Haizhou Li
51
16
0
25 Oct 2022
Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal
Umang Gupta
Jwala Dhamala
Varun Kumar
Apurv Verma
Yada Pruksachatkun
Satyapriya Krishna
Rahul Gupta
Kai-Wei Chang
Greg Ver Steeg
Aram Galstyan
31
52
0
23 Mar 2022
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
155
5,328
0
07 Jul 2021
A Comprehensive Assessment of Dialog Evaluation Metrics
Yi-Ting Yeh
M. Eskénazi
Shikib Mehri
53
107
0
07 Jun 2021
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
Su Lin Blodgett
Solon Barocas
Hal Daumé
Hanna M. Wallach
92
1,211
0
28 May 2020
Mitigating Gender Bias in Natural Language Processing: Literature Review
Tony Sun
Andrew Gaut
Shirlyn Tang
Yuxin Huang
Mai Elsherief
Jieyu Zhao
Diba Mirza
E. Belding-Royer
Kai-Wei Chang
William Yang Wang
AI4CE
75
548
0
21 Jun 2019
1