Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2110.10746
Cited By
Better than Average: Paired Evaluation of NLP Systems
20 October 2021
Maxime Peyrard
Wei-Ye Zhao
Steffen Eger
Robert West
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Better than Average: Paired Evaluation of NLP Systems"
20 / 20 papers shown
Title
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
100
3
0
12 Dec 2024
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
Hannes Groner
Sina Zarrieß
Steffen Eger
42
8
0
21 Jun 2024
Stronger Random Baselines for In-Context Learning
Gregory Yauney
David M. Mimno
47
2
0
19 Apr 2024
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
M. Boubdir
Edward Kim
Beyza Ermis
Marzieh Fadaee
Sara Hooker
ALM
33
18
0
22 Oct 2023
Efficient Benchmarking of Language Models
Yotam Perlitz
Elron Bandel
Ariel Gera
Ofir Arviv
L. Ein-Dor
Eyal Shnarch
Noam Slonim
Michal Shmueli-Scheuer
Leshem Choshen
ALM
24
24
0
22 Aug 2023
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Ye Hu
Kaiqiang Song
Sangwoo Cho
Xiaoyang Wang
H. Foroosh
Fei Liu
31
11
0
24 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
34
5
0
17 May 2023
Average Is Not Enough: Caveats of Multilingual Evaluation
Matúš Pikuliak
Marian Simko
21
3
0
03 Jan 2023
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo
Maxime Peyrard
Nathan Noiry
Robert West
Pablo Piantanida
49
11
0
31 Aug 2022
Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son
Jiho Jin
Haneul Yoo
Jinyeong Bak
Kyunghyun Cho
Alice H. Oh
35
4
0
20 May 2022
Descartes: Generating Short Descriptions of Wikipedia Articles
Marija Sakota
Maxime Peyrard
Robert West
VLM
20
2
0
20 May 2022
Exact Paired-Permutation Testing for Structured Test Statistics
Ran Zmigrod
Tim Vieira
Ryan Cotterell
14
5
0
03 May 2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
AAML
ELM
30
20
0
21 Mar 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Shikib Mehri
Jinho Choi
L. F. D’Haro
Jan Deriu
M. Eskénazi
...
David Traum
Yi-Ting Yeh
Zhou Yu
Yizhe Zhang
Chen Zhang
30
21
0
18 Mar 2022
What are the best systems? New perspectives on NLP Benchmarking
Pierre Colombo
Nathan Noiry
Ekhine Irurozki
Stéphan Clémençon
27
28
0
08 Feb 2022
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Wei-Ye Zhao
Michael Strube
Steffen Eger
27
37
0
26 Jan 2022
Invariant Language Modeling
Maxime Peyrard
Sarvjeet Ghotra
Martin Josifoski
Vidhan Agarwal
Barun Patra
Dean Carignan
Emre Kıcıman
Robert West
29
13
0
16 Oct 2021
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
M. Fomicheva
Piyawat Lertvittayakumjorn
Wei-Ye Zhao
Steffen Eger
Yang Gao
ELM
24
39
0
08 Oct 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
24
93
0
30 Jun 2021
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
196
3,513
0
10 Jun 2015
1