Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.06052
Cited By
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
21 May 2021
Zhiyi Ma
Kawin Ethayarajh
Tristan Thrush
Somya Jain
Ledell Yu Wu
Robin Jia
Christopher Potts
Adina Williams
Douwe Kiela
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking"
50 / 61 papers shown
Title
CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models
Shengzhuang Chen
Yikai Liao
Xiaoxiao Sun
Kede Ma
Ying Wei
117
0
0
06 Mar 2025
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Ziang Xiao
Shu Wang
Xing Xie
ELM
ALM
132
8
0
20 Jun 2024
Investigating Failures of Automatic Translation in the Case of Unambiguous Gender
Adithya Renduchintala
Adina Williams
67
27
0
16 Apr 2021
ExplainaBoard: An Explainable Leaderboard for NLP
Pengfei Liu
Jinlan Fu
Yanghua Xiao
Weizhe Yuan
Shuaichen Chang
Junqi Dai
Yixin Liu
Zihuiwen Ye
Zi-Yi Dou
Graham Neubig
XAI
LRM
ELM
69
55
0
13 Apr 2021
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
...
Pontus Stenetorp
Robin Jia
Joey Tianyi Zhou
Christopher Potts
Adina Williams
208
409
0
07 Apr 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
311
285
0
02 Feb 2021
Robustness Gym: Unifying the NLP Evaluation Landscape
Karan Goel
Nazneen Rajani
Jesse Vig
Samson Tan
Jason M. Wu
Stephan Zheng
Caiming Xiong
Joey Tianyi Zhou
Christopher Ré
AAML
OffRL
OOD
190
140
0
13 Jan 2021
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
Bertie Vidgen
Tristan Thrush
Zeerak Talat
Douwe Kiela
130
271
0
31 Dec 2020
HateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
99
274
0
31 Dec 2020
DynaSent: A Dynamic Benchmark for Sentiment Analysis
Christopher Potts
Zhengxuan Wu
Atticus Geiger
Douwe Kiela
263
80
0
30 Dec 2020
Data and its (dis)contents: A survey of dataset development and use in machine learning research
Amandalynne Paullada
Inioluwa Deborah Raji
Emily M. Bender
Emily L. Denton
A. Hanna
121
525
0
09 Dec 2020
With Little Power Comes Great Responsibility
Dallas Card
Peter Henderson
Urvashi Khandelwal
Robin Jia
Kyle Mahowald
Dan Jurafsky
260
118
0
13 Oct 2020
Astraea: Grammar-based Fairness Testing
E. Soremekun
Sakshi Udeshi
Sudipta Chattopadhyay
113
30
0
06 Oct 2020
Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
Patrick Lewis
Pontus Stenetorp
Sebastian Riedel
OOD
ELM
155
187
0
06 Aug 2020
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
Nitika Mathur
Tim Baldwin
Trevor Cohn
56
247
0
11 Jun 2020
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
AAML
165
2,750
0
05 Jun 2020
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
Su Lin Blodgett
Solon Barocas
Hal Daumé
Hanna M. Wallach
157
1,248
0
28 May 2020
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
210
1,107
0
08 May 2020
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Tal Linzen
276
194
0
03 May 2020
Multi-Dimensional Gender Bias Classification
Emily Dinan
Angela Fan
Ledell Yu Wu
Jason Weston
Douwe Kiela
Adina Williams
FaML
66
123
0
01 May 2020
BLEU might be Guilty but References are not Innocent
Markus Freitag
David Grangier
Isaac Caswell
57
149
0
13 Apr 2020
BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam
Dipanjan Das
Ankur P. Parikh
103
1,503
0
09 Apr 2020
Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)
Joelle Pineau
Philippe Vincent-Lamarre
Koustuv Sinha
V. Larivière
A. Beygelzimer
Florence dÁlché-Buc
E. Fox
Hugo Larochelle
83
361
0
27 Mar 2020
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
Jesse Dodge
Gabriel Ilharco
Roy Schwartz
Ali Farhadi
Hannaneh Hajishirzi
Noah A. Smith
101
597
0
15 Feb 2020
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension
Max Bartolo
A. Roberts
Johannes Welbl
Sebastian Riedel
Pontus Stenetorp
AAML
112
175
0
02 Feb 2020
Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning
Peter Henderson
Jie Hu
Joshua Romoff
Emma Brunskill
Dan Jurafsky
Joelle Pineau
89
456
0
31 Jan 2020
Value-laden Disciplinary Shifts in Machine Learning
Ravit Dotan
S. Milli
AILaw
66
48
0
03 Dec 2019
Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation
Emily Dinan
Angela Fan
Adina Williams
Jack Urbanek
Douwe Kiela
Jason Weston
92
208
0
10 Nov 2019
Adversarial NLI: A New Benchmark for Natural Language Understanding
Yixin Nie
Adina Williams
Emily Dinan
Joey Tianyi Zhou
Jason Weston
Douwe Kiela
127
1,010
0
31 Oct 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
477
20,317
0
23 Oct 2019
MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension
Adam Fisch
Alon Talmor
Robin Jia
Minjoon Seo
Eunsol Choi
Danqi Chen
74
307
0
22 Oct 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
SSL
AIMat
373
6,467
0
26 Sep 2019
Certified Robustness to Adversarial Word Substitutions
Robin Jia
Aditi Raghunathan
Kerem Göksel
Percy Liang
AAML
337
294
0
03 Sep 2019
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
Mor Geva
Yoav Goldberg
Jonathan Berant
323
326
0
21 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
677
24,541
0
26 Jul 2019
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell
Ananya Ganesh
Andrew McCallum
73
2,660
0
05 Jun 2019
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Alex Jinpeng Wang
Yada Pruksachatkun
Nikita Nangia
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
274
2,323
0
02 May 2019
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang
Varsha Kishore
Felix Wu
Kilian Q. Weinberger
Yoav Artzi
352
5,860
0
21 Apr 2019
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
R. Thomas McCoy
Ellie Pavlick
Tal Linzen
139
1,244
0
04 Feb 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.8K
95,175
0
11 Oct 2018
Model Cards for Model Reporting
Margaret Mitchell
Simone Wu
Andrew Zaldivar
Parker Barnes
Lucy Vasserman
Ben Hutchinson
Elena Spitzer
Inioluwa Deborah Raji
Timnit Gebru
134
1,903
0
05 Oct 2018
Learning Gender-Neutral Word Embeddings
Jieyu Zhao
Yichao Zhou
Zeyu Li
Wei Wang
Kai-Wei Chang
FaML
103
415
0
29 Aug 2018
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik
Zachary Chase Lipton
ELM
78
233
0
14 Aug 2018
Stress Test Evaluation for Natural Language Inference
Aakanksha Naik
Abhilasha Ravichander
Norman M. Sadeh
Carolyn Rose
Graham Neubig
ELM
86
379
0
02 Jun 2018
Gender Bias in Coreference Resolution
Rachel Rudinger
Jason Naradowsky
Brian Leonard
Benjamin Van Durme
72
644
0
25 Apr 2018
A Call for Clarity in Reporting BLEU Scores
Matt Post
179
2,996
0
23 Apr 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
1.1K
7,196
0
20 Apr 2018
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods
Jieyu Zhao
Tianlu Wang
Mark Yatskar
Vicente Ordonez
Kai-Wei Chang
130
942
0
18 Apr 2018
AllenNLP: A Deep Semantic Natural Language Processing Platform
Matt Gardner
Joel Grus
Mark Neumann
Oyvind Tafjord
Pradeep Dasigi
Nelson F. Liu
Matthew E. Peters
Michael Schmitz
Luke Zettlemoyer
VLM
90
1,282
0
20 Mar 2018
Annotation Artifacts in Natural Language Inference Data
Suchin Gururangan
Swabha Swayamdipta
Omer Levy
Roy Schwartz
Samuel R. Bowman
Noah A. Smith
155
1,180
0
06 Mar 2018
1
2
Next