Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.14424
Cited By
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
22 October 2023
M. Boubdir
Edward Kim
Beyza Ermis
Marzieh Fadaee
Sara Hooker
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation"
23 / 23 papers shown
Title
Language Imbalance Driven Rewarding for Multilingual Self-improving
Wen Yang
Junhong Wu
Chen Wang
Chengqing Zong
J.N. Zhang
ALM
LRM
136
7
0
11 Oct 2024
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
Kehua Feng
Keyan Ding
Hongzhi Tan
Kede Ma
Zhihua Wang
...
Yuzhou Cheng
Ge Sun
Guozhou Zheng
Qiang Zhang
H. Chen
60
12
0
10 Apr 2024
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
302
4,253
0
09 Jun 2023
Koala: An Index for Quantifying Overlaps with Pre-training Corpora
Thuy-Trang Vu
Xuanli He
Gholamreza Haffari
Ehsan Shareghi
CLL
44
14
0
26 Mar 2023
Fine-tuning language models to find agreement among humans with diverse preferences
Michiel A. Bakker
Martin Chadwick
Hannah R. Sheahan
Michael Henry Tessler
Lucy Campbell-Gillingham
...
Nat McAleese
Amelia Glaese
John Aslanides
M. Botvinick
Christopher Summerfield
ALM
92
228
0
28 Nov 2022
A Close Look into the Calibration of Pre-trained Language Models
Yangyi Chen
Lifan Yuan
Ganqu Cui
Zhiyuan Liu
Heng Ji
99
51
0
31 Oct 2022
Out of the BLEU: how should we assess quality of the Code Generation models?
Mikhail Evtikhiev
Egor Bogomolov
Yaroslav Sokolov
T. Bryksin
ALM
58
92
0
05 Aug 2022
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann
Abhik Bhattacharjee
Abinaya Mahendiran
Alex Jinpeng Wang
Alexandros Papangelis
...
Yacine Jernite
Yi Xu
Yisi Sang
Yixin Liu
Yufang Hou
77
38
0
22 Jun 2022
A global analysis of metrics used for measuring performance in natural language processing
Kathrin Blagec
Georg Dorffner
M. Moradi
Simon Ott
Matthias Samwald
61
27
0
25 Apr 2022
Teaching language models to support answers with verified quotes
Jacob Menick
Maja Trebacz
Vladimir Mikulik
John Aslanides
Francis Song
...
Mia Glaese
Susannah Young
Lucy Campbell-Gillingham
G. Irving
Nat McAleese
ELM
RALM
279
264
0
21 Mar 2022
Towards Responsible Natural Language Annotation for the Varieties of Arabic
A. S. Bergman
Mona T. Diab
71
19
0
17 Mar 2022
Dynamic Human Evaluation for Relative Model Comparisons
Thórhildur Thorleiksdóttir
Cédric Renggli
Nora Hollenstein
Ce Zhang
61
2
0
15 Dec 2021
A General Language Assistant as a Laboratory for Alignment
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
...
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
103
773
0
01 Dec 2021
LMdiff: A Visual Diff Tool to Compare Language Models
Hendrik Strobelt
Benjamin Hoover
Arvind Satyanarayan
Sebastian Gehrmann
VLM
43
19
0
02 Nov 2021
Better than Average: Paired Evaluation of NLP Systems
Maxime Peyrard
Wei Zhao
Steffen Eger
Robert West
ELM
55
25
0
20 Oct 2021
BARTScore: Evaluating Generated Text as Text Generation
Weizhe Yuan
Graham Neubig
Pengfei Liu
93
838
0
22 Jun 2021
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge
Maarten Sap
Ana Marasović
William Agnew
Gabriel Ilharco
Dirk Groeneveld
Margaret Mitchell
Matt Gardner
AILaw
77
443
0
18 Apr 2021
Evaluation of Text Generation: A Survey
Asli Celikyilmaz
Elizabeth Clark
Jianfeng Gao
ELM
LM&MA
94
384
0
26 Jun 2020
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
441
1,715
0
18 Sep 2019
A Study of BFLOAT16 for Deep Learning Training
Dhiraj D. Kalamkar
Dheevatsa Mudigere
Naveen Mellempudi
Dipankar Das
K. Banerjee
...
Sudarshan Srinivasan
Abhisek Kundu
M. Smelyanskiy
Bharat Kaul
Pradeep Dubey
MQ
74
343
0
29 May 2019
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor
Jonathan Herzig
Nicholas Lourie
Jonathan Berant
RALM
129
1,714
0
02 Nov 2018
The NarrativeQA Reading Comprehension Challenge
Tomás Kociský
Jonathan Richard Schwarz
Phil Blunsom
Chris Dyer
Karl Moritz Hermann
Gábor Melis
Edward Grefenstette
122
768
0
19 Dec 2017
International Standard for a Linguistic Annotation Framework
Laurent Romary
Nancy Ide
151
284
0
22 Jul 2007
1