Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
Disentangled Contrastive Learning for Learning Robust Textual Representations
Xiang Chen
Xin Xie
Zhen Bi
Hongbin Ye
Shumin Deng
Ningyu Zhang
Huajun Chen
33
5
0
11 Apr 2021
Connecting Attributions and QA Model Behavior on Realistic Counterfactuals
Xi Ye
Rohan Nair
Greg Durrett
18
24
0
09 Apr 2021
KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding
Keyur Faldu
A. Sheth
Prashant Kikani
Hemang Akabari
16
28
0
09 Apr 2021
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
...
Pontus Stenetorp
Robin Jia
Joey Tianyi Zhou
Christopher Potts
Adina Williams
24
387
0
07 Apr 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Samuel R. Bowman
George E. Dahl
ELM
ALM
30
156
0
05 Apr 2021
TMR: Evaluating NER Recall on Tough Mentions
Jingxuan Tu
Constantine Lignos
29
4
0
23 Mar 2021
Local Interpretations for Explainable Natural Language Processing: A Survey
Siwen Luo
Hamish Ivison
S. Han
Josiah Poon
MILM
33
48
0
20 Mar 2021
Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence
Tal Schuster
Adam Fisch
Regina Barzilay
28
225
0
15 Mar 2021
Visual Cues and Error Correction for Translation Robustness
Zhenhao Li
Marek Rei
Lucia Specia
20
3
0
12 Mar 2021
Are NLP Models really able to Solve Simple Math Word Problems?
Arkil Patel
S. Bhattamishra
Navin Goyal
ReLM
LRM
27
766
0
12 Mar 2021
Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks
A. Wang
Dakuo Wang
Jaimie Drozdal
Michael J. Muller
Soya Park
Justin D. Weisz
Xuye Liu
Lingfei Wu
Casey Dugan
44
63
0
24 Feb 2021
Testing Framework for Black-box AI Models
Aniya Aggarwal
Samiullah Shaikh
Sandeep Hans
Swastik Haldar
Rema Ananthanarayanan
Diptikalyan Saha
16
8
0
11 Feb 2021
Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy
Dylan Slack
N. Rauschmayr
K. Kenthapadi
AAML
17
2
0
11 Feb 2021
Statistically Profiling Biases in Natural Language Reasoning Datasets and Models
Shanshan Huang
Kenny Q. Zhu
16
1
0
09 Feb 2021
The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point
Magnus Sahlgren
F. Carlsson
30
26
0
08 Feb 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
257
285
0
02 Feb 2021
Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar
Nora Kassner
Shauli Ravfogel
Abhilasha Ravichander
Eduard H. Hovy
Hinrich Schütze
Yoav Goldberg
HILM
269
346
0
01 Feb 2021
ShufText: A Simple Black Box Approach to Evaluate the Fragility of Text Classification Models
Rutuja Taware
Shraddha Varat
G. Salunke
Chaitanya Gawande
Geetanjali Kale
Rahul Khengare
Raviraj Joshi
17
5
0
30 Jan 2021
CD2CR: Co-reference Resolution Across Documents and Domains
James Ravenscroft
Arie Cattan
A. Clare
Ido Dagan
Maria Liakata
75
8
0
29 Jan 2021
Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis in the Wild
Rajdeep Mukherjee
Shreyas Shetty
S. Chattopadhyay
Subhadeep Maji
S. Datta
Pawan Goyal
35
14
0
23 Jan 2021
Robustness Gym: Unifying the NLP Evaluation Landscape
Karan Goel
Nazneen Rajani
Jesse Vig
Samson Tan
Jason M. Wu
Stephan Zheng
Caiming Xiong
Joey Tianyi Zhou
Christopher Ré
AAML
OffRL
OOD
154
137
0
13 Jan 2021
Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models
Tongshuang Wu
Marco Tulio Ribeiro
Jeffrey Heer
Daniel S. Weld
41
240
0
01 Jan 2021
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Han Guo
Nazneen Rajani
Peter Hase
Joey Tianyi Zhou
Caiming Xiong
TDI
41
102
0
31 Dec 2020
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
Bertie Vidgen
Tristan Thrush
Zeerak Talat
Douwe Kiela
26
242
0
31 Dec 2020
HateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
31
259
0
31 Dec 2020
Robustness Testing of Language Understanding in Task-Oriented Dialog
Jiexi Liu
Ryuichi Takanobu
Jiaxin Wen
Dazhen Wan
Hongguang Li
Weiran Nie
Cheng Li
Wei Peng
Minlie Huang
ELM
30
48
0
30 Dec 2020
RoCUS: Robot Controller Understanding via Sampling
Yilun Zhou
Serena Booth
Nadia Figueroa
J. Shah
24
13
0
25 Dec 2020
I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling
Yixin Nie
Mary Williamson
Joey Tianyi Zhou
Douwe Kiela
Jason Weston
11
82
0
24 Dec 2020
MT-Teql: Evaluating and Augmenting Consistency of Text-to-SQL Models with Metamorphic Testing
Pingchuan Ma
Shuai Wang
16
2
0
21 Dec 2020
Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals
Zhao Wang
A. Culotta
CML
OOD
14
98
0
18 Dec 2020
WILDS: A Benchmark of in-the-Wild Distribution Shifts
Pang Wei Koh
Shiori Sagawa
Henrik Marklund
Sang Michael Xie
Marvin Zhang
...
A. Kundaje
Emma Pierson
Sergey Levine
Chelsea Finn
Percy Liang
OOD
53
1,377
0
14 Dec 2020
Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text
Nishtha Madaan
Inkit Padhi
Naveen Panwar
Diptikalyan Saha
CML
41
98
0
08 Dec 2020
Detection and Classification of mental illnesses on social media using RoBERTa
Ankit Murarka
Balaji Radhakrishnan
S. Ravichandran
AI4MH
14
45
0
23 Nov 2020
Challenges in Deploying Machine Learning: a Survey of Case Studies
Andrei Paleyes
Raoul-Gabriel Urma
Neil D. Lawrence
23
389
0
18 Nov 2020
SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher
Thai Le
Noseong Park
Dongwon Lee
AAML
6
20
0
17 Nov 2020
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Alexander DÁmour
Katherine A. Heller
D. Moldovan
Ben Adlam
B. Alipanahi
...
Kellie Webster
Steve Yadlowsky
T. Yun
Xiaohua Zhai
D. Sculley
OffRL
77
670
0
06 Nov 2020
Influence Patterns for Explaining Information Flow in BERT
Kaiji Lu
Zifan Wang
Piotr (Peter) Mardziel
Anupam Datta
GNN
27
16
0
02 Nov 2020
ABNIRML: Analyzing the Behavior of Neural IR Models
Sean MacAvaney
Sergey Feldman
Nazli Goharian
Doug Downey
Arman Cohan
15
49
0
02 Nov 2020
CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers
Shiyang Li
Semih Yavuz
Kazuma Hashimoto
Jia Li
Tong Niu
Nazneen Rajani
Xifeng Yan
Yingbo Zhou
Caiming Xiong
44
62
0
24 Oct 2020
Measuring Association Between Labels and Free-Text Rationales
Sarah Wiegreffe
Ana Marasović
Noah A. Smith
282
170
0
24 Oct 2020
Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?
Peter Shaw
Ming-Wei Chang
Panupong Pasupat
Kristina Toutanova
CoGe
27
182
0
24 Oct 2020
Learning to Recognize Dialect Features
Dorottya Demszky
D. Sharma
J. Clark
Vinodkumar Prabhakaran
Jacob Eisenstein
114
38
0
23 Oct 2020
Semantics of the Black-Box: Can knowledge graphs help make deep learning systems more interpretable and explainable?
Manas Gaur
Keyur Faldu
A. Sheth
37
113
0
16 Oct 2020
Multi-task Learning of Negation and Speculation for Targeted Sentiment Classification
Andrew Moore
Jeremy Barnes
33
9
0
16 Oct 2020
Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI
Alon Jacovi
Ana Marasović
Tim Miller
Yoav Goldberg
255
426
0
15 Oct 2020
Fine-grained linguistic evaluation for state-of-the-art Machine Translation
Eleftherios Avramidis
Vivien Macketanz
Ursula Strohriegel
A. Burchardt
Sebastian Möller
ELM
8
16
0
13 Oct 2020
Thinking Fast and Slow in AI
G. Booch
F. Fabiano
L. Horesh
Kiran Kate
J. Lenchner
...
Andrea Loreggia
K. Murugesan
Nicholas Mattei
F. Rossi
Biplav Srivastava
21
94
0
12 Oct 2020
Can RNNs trained on harder subject-verb agreement instances still perform well on easier ones?
Hritik Bansal
Gantavya Bhatt
Sumeet Agarwal
19
0
0
10 Oct 2020
A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions
Takuma Udagawa
T. Yamazaki
Akiko Aizawa
27
11
0
07 Oct 2020
Astraea: Grammar-based Fairness Testing
E. Soremekun
Sakshi Udeshi
Sudipta Chattopadhyay
26
27
0
06 Oct 2020
Previous
1
2
3
...
12
13
14
Next