Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.13888
Cited By
Utility is in the Eye of the User: A Critique of NLP Leaderboards
29 September 2020
Kawin Ethayarajh
Dan Jurafsky
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Utility is in the Eye of the User: A Critique of NLP Leaderboards"
34 / 34 papers shown
Title
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Guanhua Zhang
Moritz Hardt
42
7
0
02 May 2024
A Roadmap to Pluralistic Alignment
Taylor Sorensen
Jared Moore
Jillian R. Fisher
Mitchell L. Gordon
Niloofar Mireshghallah
...
Liwei Jiang
Ximing Lu
Nouha Dziri
Tim Althoff
Yejin Choi
65
80
0
07 Feb 2024
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
Momchil Hardalov
Pepa Atanasova
Todor Mihaylov
G. Angelova
K. Simov
P. Osenova
Ves Stoyanov
Ivan Koychev
Preslav Nakov
Dragomir R. Radev
ELM
FedML
21
4
0
04 Jun 2023
It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
Arjun Subramonian
Xingdi Yuan
Hal Daumé
Su Lin Blodgett
37
17
0
15 May 2023
This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish
Lukasz Augustyniak
Kamil Tagowski
Albert Sawczyn
Denis Janiak
Roman Bartusiak
...
Arkadiusz Janz
Piotr Szymañski
M. Morzy
Tomasz Kajdanowicz
Maciej Piasecki
18
10
0
23 Nov 2022
An Interdisciplinary Perspective on Evaluation and Experimental Design for Visual Text Analytics: Position Paper
Kostiantyn Kucher
N. Sultanum
Angel Daza
Vasiliki Simaki
Maria Skeppstedt
Barbara Plank
Jean-Daniel Fekete
Narges Mahyar
8
4
0
23 Sep 2022
Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold
Sebastian Ruder
Ivan Vulić
Anders Søgaard
33
29
0
20 Jun 2022
Evaluation Gaps in Machine Learning Practice
Ben Hutchinson
Negar Rostamzadeh
Christina Greer
Katherine A. Heller
Vinodkumar Prabhakaran
ELM
28
56
0
11 May 2022
Richer Countries and Richer Representations
Kaitlyn Zhou
Kawin Ethayarajh
Dan Jurafsky
38
9
0
10 May 2022
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
Kaitlyn Zhou
Kawin Ethayarajh
Dallas Card
Dan Jurafsky
31
66
0
10 May 2022
Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages
C. Forbes
Farhan Samir
Bruce Oliver
Changbing Yang
Edith Coates
Garrett Nicolai
Miikka Silfverberg
9
1
0
17 Mar 2022
Mukayese: Turkish NLP Strikes Back
Ali Safaya
Emirhan Kurtulucs
Arda Goktougan
Deniz Yuret
28
22
0
02 Mar 2022
Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis
A. Perevalov
Xiongliang Yan
Liubov Kovriguina
Longquan Jiang
A. Both
Ricardo Usbeck
ELM
20
19
0
20 Jan 2022
How not to Lie with a Benchmark: Rearranging NLP Leaderboards
Tatiana Shavrina
Valentin Malykh
ALM
ELM
418
10
0
02 Dec 2021
AI and the Everything in the Whole Wide World Benchmark
Inioluwa Deborah Raji
Emily M. Bender
Amandalynne Paullada
Emily L. Denton
A. Hanna
30
291
0
26 Nov 2021
Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking
Dirk Vath
Pascal Tilli
Ngoc Thang Vu
33
4
0
11 Oct 2021
Expected Validation Performance and Estimation of a Random Variable's Maximum
Jesse Dodge
Suchin Gururangan
Dallas Card
Roy Schwartz
Noah A. Smith
46
9
0
01 Oct 2021
Survey of Low-Resource Machine Translation
Barry Haddow
Rachel Bawden
Antonio Valerio Miceli Barone
Jindvrich Helcl
Alexandra Birch
AIMat
31
147
0
01 Sep 2021
Challenges for cognitive decoding using deep learning methods
A. Thomas
Christopher Ré
R. Poldrack
AI4CE
16
6
0
16 Aug 2021
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence
Alexander Miserlis Hoyle
Pranav Goel
Denis Peskov
Andrew Hian-Cheong
Jordan L. Boyd-Graber
Philip Resnik
35
127
0
05 Jul 2021
The Values Encoded in Machine Learning Research
Abeba Birhane
Pratyusha Kalluri
Dallas Card
William Agnew
Ravit Dotan
Michelle Bao
25
274
0
29 Jun 2021
A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation
Sebastin Santy
Prasanta Bhattacharya
LLMAG
33
2
0
11 Jun 2021
Towards transparency in NLP shared tasks
Carla Parra Escartín
Teresa Lynn
Joss Moorkens
Jane Dunne
18
4
0
11 May 2021
Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks
Tatiana Iazykova
Denis Kapelyushnik
Olga Bystrova
Andrey Kutuzov
ELM
8
1
0
03 May 2021
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
Sebastian Ruder
Noah Constant
Jan A. Botha
Aditya Siddhant
Orhan Firat
...
Pengfei Liu
Junjie Hu
Dan Garrette
Graham Neubig
Melvin Johnson
ELM
AAML
LRM
13
184
0
15 Apr 2021
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
...
Pontus Stenetorp
Robin Jia
Mohit Bansal
Christopher Potts
Adina Williams
24
387
0
07 Apr 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Samuel R. Bowman
George E. Dahl
ELM
ALM
28
156
0
05 Apr 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
248
285
0
02 Feb 2021
Fairness in Machine Learning
L. Oneto
Silvia Chiappa
FaML
256
488
0
31 Dec 2020
What Can We Do to Improve Peer Review in NLP?
Anna Rogers
Isabelle Augenstein
19
49
0
08 Oct 2020
The Computational Limits of Deep Learning
Neil C. Thompson
Kristjan Greenewald
Keeheon Lee
Gabriel F. Manso
VLM
15
505
0
10 Jul 2020
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Tal Linzen
220
188
0
03 May 2020
Certified Robustness to Adversarial Word Substitutions
Robin Jia
Aditi Raghunathan
Kerem Göksel
Percy Liang
AAML
183
290
0
03 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,956
0
20 Apr 2018
1