ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.00696
  4. Cited By
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
v1v2v3 (latest)

Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

1 September 2024
Jasper Dekoninck
Maximilian Baader
Martin Vechev
    ALM
ArXiv (abs)PDFHTML

Papers citing "Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation"

20 / 20 papers shown
Title
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in
  the Wild
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Bill Yuchen Lin
Yuntian Deng
Khyathi Chandu
Faeze Brahman
Abhilasha Ravichander
Valentina Pyatkin
Nouha Dziri
Ronan Le Bras
Yejin Choi
113
82
0
07 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
87
48
0
03 Jun 2024
A Careful Examination of Large Language Model Performance on Grade
  School Arithmetic
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
...
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
ALMLRMELM
110
110
0
01 May 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois
Balázs Galambosi
Percy Liang
Tatsunori Hashimoto
ALM
184
406
0
06 Apr 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
...
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
188
608
0
07 Mar 2024
Humans or LLMs as the Judge? A Study on Judgement Biases
Humans or LLMs as the Judge? A Study on Judgement Biases
Guiming Hardy Chen
Shunian Chen
Ziche Liu
Feng Jiang
Benyou Wang
227
113
0
16 Feb 2024
Evading Data Contamination Detection for Language Models is (too) Easy
Evading Data Contamination Detection for Language Models is (too) Easy
Jasper Dekoninck
Mark Niklas Muller
Maximilian Baader
Marc Fischer
Martin Vechev
199
20
0
05 Feb 2024
A Long Way to Go: Investigating Length Correlations in RLHF
A Long Way to Go: Investigating Length Correlations in RLHF
Prasann Singhal
Tanya Goyal
Jiacheng Xu
Greg Durrett
163
161
0
05 Oct 2023
Human Feedback is not Gold Standard
Human Feedback is not Gold Standard
Tom Hosking
Phil Blunsom
Max Bartolo
ALM
132
55
0
28 Sep 2023
Don't lose the message while paraphrasing: A study on content preserving
  style transfer
Don't lose the message while paraphrasing: A study on content preserving style transfer
N. Babakov
David Dale
I. Gusev
I. Krotova
Sergey Petrakov
95
21
0
17 Aug 2023
Style Over Substance: Evaluation Biases for Large Language Models
Style Over Substance: Evaluation Biases for Large Language Models
Minghao Wu
Alham Fikri Aji
ALMELM
169
47
0
06 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALMOSLMELM
750
4,483
0
09 Jun 2023
Large Language Models are not Fair Evaluators
Large Language Models are not Fair Evaluators
Peiyi Wang
Lei Li
Liang Chen
Zefan Cai
Dawei Zhu
Binghuai Lin
Yunbo Cao
Qi Liu
Tianyu Liu
Zhifang Sui
ALM
208
577
0
29 May 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.7K
14,919
0
15 Mar 2023
TweetNLP: Cutting-Edge Natural Language Processing for Social Media
TweetNLP: Cutting-Edge Natural Language Processing for Social Media
Jose Camacho-Collados
Kiamehr Rezaee
Talayeh Riahi
Asahi Ushio
Daniel Loureiro
...
Eugenio Martínez-Cámara
Gonzalo Medina
T. Buhrmann
Leonardo Neves
Francesco Barbieri
VLMAI4MH
116
144
0
29 Jun 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
472
4,630
0
27 Oct 2021
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
  Text
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark
Tal August
Sofia Serrano
Nikita Haduong
Suchin Gururangan
Noah A. Smith
DeLMO
196
419
0
30 Jun 2021
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELMRALM
555
4,610
0
07 Sep 2020
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning
  Challenge
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELMRALMLRM
315
2,698
0
14 Mar 2018
How I won the "Chess Ratings - Elo vs the Rest of the World" Competition
How I won the "Chess Ratings - Elo vs the Rest of the World" Competition
Yannis Sismanis
99
21
0
21 Dec 2010
1