ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.10746
  4. Cited By
Better than Average: Paired Evaluation of NLP Systems

Better than Average: Paired Evaluation of NLP Systems

20 October 2021
Maxime Peyrard
Wei-Ye Zhao
Steffen Eger
Robert West
    ELM
ArXivPDFHTML

Papers citing "Better than Average: Paired Evaluation of NLP Systems"

20 / 20 papers shown
Title
JuStRank: Benchmarking LLM Judges for System Ranking
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
100
3
0
12 Dec 2024
Evaluating Diversity in Automatic Poetry Generation
Evaluating Diversity in Automatic Poetry Generation
Yanran Chen
Hannes Groner
Sina Zarrieß
Steffen Eger
42
8
0
21 Jun 2024
Stronger Random Baselines for In-Context Learning
Stronger Random Baselines for In-Context Learning
Gregory Yauney
David M. Mimno
47
2
0
19 Apr 2024
Which Prompts Make The Difference? Data Prioritization For Efficient
  Human LLM Evaluation
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
M. Boubdir
Edward Kim
Beyza Ermis
Marzieh Fadaee
Sara Hooker
ALM
33
18
0
22 Oct 2023
Efficient Benchmarking of Language Models
Efficient Benchmarking of Language Models
Yotam Perlitz
Elron Bandel
Ariel Gera
Ofir Arviv
L. Ein-Dor
Eyal Shnarch
Noam Slonim
Michal Shmueli-Scheuer
Leshem Choshen
ALM
24
24
0
22 Aug 2023
DecipherPref: Analyzing Influential Factors in Human Preference
  Judgments via GPT-4
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Ye Hu
Kaiqiang Song
Sangwoo Cho
Xiaoyang Wang
H. Foroosh
Fei Liu
31
11
0
24 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in
  Benchmarks
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
34
5
0
17 May 2023
Average Is Not Enough: Caveats of Multilingual Evaluation
Average Is Not Enough: Caveats of Multilingual Evaluation
Matúš Pikuliak
Marian Simko
21
3
0
03 Jan 2023
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo
Maxime Peyrard
Nathan Noiry
Robert West
Pablo Piantanida
49
11
0
31 Aug 2022
Translating Hanja Historical Documents to Contemporary Korean and
  English
Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son
Jiho Jin
Haneul Yoo
Jinyeong Bak
Kyunghyun Cho
Alice H. Oh
35
4
0
20 May 2022
Descartes: Generating Short Descriptions of Wikipedia Articles
Descartes: Generating Short Descriptions of Wikipedia Articles
Marija Sakota
Maxime Peyrard
Robert West
VLM
20
2
0
20 May 2022
Exact Paired-Permutation Testing for Structured Test Statistics
Exact Paired-Permutation Testing for Structured Test Statistics
Ran Zmigrod
Tim Vieira
Ryan Cotterell
14
5
0
03 May 2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
AAML
ELM
30
20
0
21 Mar 2022
Report from the NSF Future Directions Workshop on Automatic Evaluation
  of Dialog: Research Directions and Challenges
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Shikib Mehri
Jinho Choi
L. F. D’Haro
Jan Deriu
M. Eskénazi
...
David Traum
Yi-Ting Yeh
Zhou Yu
Yizhe Zhang
Chen Zhang
30
21
0
18 Mar 2022
What are the best systems? New perspectives on NLP Benchmarking
What are the best systems? New perspectives on NLP Benchmarking
Pierre Colombo
Nathan Noiry
Ekhine Irurozki
Stéphan Clémençon
27
28
0
08 Feb 2022
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Wei-Ye Zhao
Michael Strube
Steffen Eger
27
37
0
26 Jan 2022
Invariant Language Modeling
Invariant Language Modeling
Maxime Peyrard
Sarvjeet Ghotra
Martin Josifoski
Vidhan Agarwal
Barun Patra
Dean Carignan
Emre Kıcıman
Robert West
29
13
0
16 Oct 2021
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and
  Results
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
M. Fomicheva
Piyawat Lertvittayakumjorn
Wei-Ye Zhao
Steffen Eger
Yang Gao
ELM
24
39
0
08 Oct 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
24
93
0
30 Jun 2021
Teaching Machines to Read and Comprehend
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
196
3,513
0
10 Jun 2015
1