Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
Data Synthesis for Testing Black-Box Machine Learning Models
Diptikalyan Saha
Aniya Aggarwal
Sandeep Hans
22
4
0
03 Nov 2021
Template Filling for Controllable Commonsense Reasoning
Dheeraj Rajagopal
Vivek Khetan
Bogdan Sacaleanu
A. Gershman
Andy E. Fano
Eduard H. Hovy
BDL
LRM
25
6
0
31 Oct 2021
DAG Card is the new Model Card
Jacopo Tagliabue
Ville Tuulos
C. Greco
Valay Dave
SyDa
39
11
0
24 Oct 2021
Behavioral Experiments for Understanding Catastrophic Forgetting
Samuel J. Bell
Neil D. Lawrence
35
4
0
20 Oct 2021
AequeVox: Automated Fairness Testing of Speech Recognition Systems
Sai Sathiesh Rajan
Sakshi Udeshi
Sudipta Chattopadhyay
28
15
0
19 Oct 2021
Label-Descriptive Patterns and Their Application to Characterizing Classification Errors
Michael A. Hedderich
Jonas Fischer
Dietrich Klakow
Jilles Vreeken
6
10
0
18 Oct 2021
Predicting the Performance of Multilingual NLP Models
A. Srinivasan
Sunayana Sitaram
T. Ganu
Sandipan Dandapat
Kalika Bali
Monojit Choudhury
LRM
32
27
0
17 Oct 2021
On the Robustness of Reading Comprehension Models to Entity Renaming
Jun Yan
Yang Xiao
Sagnik Mukherjee
Bill Yuchen Lin
Robin Jia
Xiang Ren
16
20
0
16 Oct 2021
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
Sam Bowman
OffRL
24
45
0
15 Oct 2021
Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models
Tianlu Wang
Rohit Sridhar
Diyi Yang
Xuezhi Wang
AAML
120
72
0
14 Oct 2021
Retrieval-guided Counterfactual Generation for QA
Bhargavi Paranjape
Matthew Lamm
Ian Tenney
33
31
0
14 Oct 2021
The Irrationality of Neural Rationale Models
Yiming Zheng
Serena Booth
J. Shah
Yilun Zhou
35
16
0
14 Oct 2021
PARE: A Simple and Strong Baseline for Monolingual and Multilingual Distantly Supervised Relation Extraction
Vipul Rathore
Kartikeya Badola
Mausam
Parag Singla
41
19
0
14 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
54
17
0
14 Oct 2021
Interpreting the Robustness of Neural NLP Models to Textual Perturbations
Yunxiang Zhang
Liangming Pan
Samson Tan
Min-Yen Kan
33
21
0
14 Oct 2021
AutoNLU: Detecting, root-causing, and fixing NLU model errors
P. Sethi
Denis Savenkov
Forough Arabshahi
Jack Goetz
Micaela Tolliver
Nicolas Scheffer
I. Kabul
Yue Liu
Ahmed Aly
18
4
0
12 Oct 2021
Salient ImageNet: How to discover spurious features in Deep Learning?
Sahil Singla
S. Feizi
AAML
VLM
29
115
0
08 Oct 2021
Automated Testing of AI Models
Swagatam Haldar
Deepak Vijaykeerthy
Diptikalyan Saha
VLM
21
0
0
07 Oct 2021
GNN is a Counter? Revisiting GNN for Question Answering
Kuan-Chieh Jackson Wang
Yuyu Zhang
Diyi Yang
Le Song
Tao Qin
LMTD
29
30
0
07 Oct 2021
Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development
Aspen K. Hopkins
Serena Booth
29
45
0
06 Oct 2021
Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance
Karthikeyan K
Aalok Sathe
Somak Aditya
Monojit Choudhury
LRM
33
10
0
05 Oct 2021
Trustworthy AI: From Principles to Practices
Bo-wen Li
Peng Qi
Bo Liu
Shuai Di
Jingen Liu
Jiquan Pei
Jinfeng Yi
Bowen Zhou
119
356
0
04 Oct 2021
Human-Centered AI for Data Science: A Systematic Approach
Dakuo Wang
Xiaojuan Ma
A. Wang
17
3
0
03 Oct 2021
Enhancing Model Robustness and Fairness with Causality: A Regularization Approach
Zhao Wang
Kai Shu
A. Culotta
OOD
21
14
0
03 Oct 2021
Language Invariant Properties in Natural Language Processing
Federico Bianchi
Debora Nozza
Dirk Hovy
55
3
0
27 Sep 2021
RuleBert: Teaching Soft Rules to Pre-trained Language Models
Mohammed Saeed
N. Ahmadi
Preslav Nakov
Paolo Papotti
LRM
253
31
0
24 Sep 2021
Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction
Bruno Taillé
Vincent Guigue
Geoffrey Scoutheeten
Patrick Gallinari
79
5
0
24 Sep 2021
Robust Generalization of Quadratic Neural Networks via Function Identification
Kan Xu
Hamsa Bastani
Osbert Bastani
OOD
34
8
0
22 Sep 2021
Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation
Diptesh Kanojia
M. Fomicheva
Tharindu Ranasinghe
Frédéric Blain
Constantin Oruasan
Lucia Specia
18
10
0
22 Sep 2021
NADE: A Benchmark for Robust Adverse Drug Events Extraction in Face of Negations
Simone Scaboro
Beatrice Portelli
Emmanuele Chersoni
Enrico Santus
G. Serra
25
9
0
21 Sep 2021
Types of Out-of-Distribution Texts and How to Detect Them
Udit Arora
William Huang
He He
OODD
225
97
0
14 Sep 2021
Tribrid: Stance Classification with Neural Inconsistency Detection
Song Yang
Jacopo Urbani
14
6
0
14 Sep 2021
SituatedQA: Incorporating Extra-Linguistic Contexts into QA
Michael J.Q. Zhang
Eunsol Choi
RALM
32
136
0
13 Sep 2021
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Ananya B. Sai
Tanay Dixit
D. Y. Sheth
S. Mohan
Mitesh M. Khapra
AAML
116
57
0
13 Sep 2021
Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers
Shane Storks
J. Chai
51
5
0
10 Sep 2021
An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model
Kijong Han
Seojin Lee
Wooin Lee
Joosung Lee
Donghun Lee
AAML
25
5
0
10 Sep 2021
AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction
Dong-Ho Lee
Ravi Kiran Selvam
Sheikh Muhammad Sarwar
Bill Yuchen Lin
Fred Morstatter
Jay Pujara
Elizabeth Boschee
James Allan
Xiang Ren
31
2
0
10 Sep 2021
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
Amir Feder
Katherine A. Keith
Emaad A. Manzoor
Reid Pryzant
Dhanya Sridhar
...
Roi Reichart
Margaret E. Roberts
Brandon M Stewart
Victor Veitch
Diyi Yang
CML
41
234
0
02 Sep 2021
DuTrust: A Sentiment Analysis Dataset for Trustworthiness Evaluation
Lijie Wang
Hao Liu
Shu-ping Peng
Hongxuan Tang
Xinyan Xiao
Ying-Cong Chen
Hua Wu
Haifeng Wang
25
5
0
30 Aug 2021
LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation
Jian Guan
Zhuoer Feng
Yamei Chen
Ru He
Xiaoxi Mao
Changjie Fan
Minlie Huang
39
32
0
30 Aug 2021
HeadlineCause: A Dataset of News Headlines for Detecting Causalities
I. Gusev
Alexey Tikhonov
CML
14
7
0
28 Aug 2021
Deep learning models are not robust against noise in clinical text
M. Moradi
Kathrin Blagec
Matthias Samwald
OOD
25
6
0
27 Aug 2021
Evaluating the Robustness of Neural Language Models to Input Perturbations
M. Moradi
Matthias Samwald
AAML
48
95
0
27 Aug 2021
DoWhy: Addressing Challenges in Expressing and Validating Causal Assumptions
Amit Sharma
Vasilis Syrgkanis
Cheng Zhang
Emre Kıcıman
24
26
0
27 Aug 2021
ComSum: Commit Messages Summarization and Meaning Preservation
Leshem Choshen
Idan Amit
17
4
0
23 Aug 2021
Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models
Myeongjun Jang
D. Kwon
Thomas Lukasiewicz
38
13
0
15 Aug 2021
Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems
Laurel J. Orr
Atindriyo Sanyal
Xiao Ling
Karan Goel
Megan Leszczynski
25
18
0
11 Aug 2021
Using Metamorphic Relations to Verify and Enhance Artcode Classification
Liming Xu
Dave Towey
Andrew P. French
Steve Benford
Z. Zhou
T. Chen
19
8
0
05 Aug 2021
Underreporting of errors in NLG output, and what to do about it
Emiel van Miltenburg
Miruna Clinciu
Ondrej Dusek
Dimitra Gkatzia
Stephanie Inglis
...
Saad Mahamood
Emma Manning
S. Schoch
Craig Thomson
Luou Wen
27
38
0
02 Aug 2021
TabPert: An Effective Platform for Tabular Perturbation
Nupur Jain
Vivek Gupta
Anshul Rai
G. Kumar
LMTD
14
5
0
02 Aug 2021
Previous
1
2
3
...
10
11
12
13
14
Next