Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning
Vivek Gupta
Riyaz Ahmad Bhat
Atreya Ghosal
Manisha Srivastava
M. Singh
Vivek Srikumar
LMTD
15
18
0
02 Aug 2021
Did the Model Change? Efficiently Assessing Machine Learning API Shifts
Lingjiao Chen
Tracy Cai
Matei A. Zaharia
James Zou
20
17
0
29 Jul 2021
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition
Mor Geva
Tomer Wolfson
Jonathan Berant
ReLM
LRM
20
21
0
29 Jul 2021
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
Anna Rogers
Matt Gardner
Isabelle Augenstein
27
163
0
27 Jul 2021
Back-Translated Task Adaptive Pretraining: Improving Accuracy and Robustness on Text Classification
Junghoon Lee
Jounghee Kim
Pilsung Kang
VLM
13
5
0
22 Jul 2021
Spinning Sequence-to-Sequence Models with Meta-Backdoors
Eugene Bagdasaryan
Vitaly Shmatikov
SILM
AAML
38
8
0
22 Jul 2021
As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical Translation
Jun Wang
Chang Xu
Francisco Guzman
Ahmed El-Kishky
Benjamin I. P. Rubinstein
Trevor Cohn
27
10
0
18 Jul 2021
M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis
Xingbo Wang
Jianben He
Zhihua Jin
Muqiao Yang
Yong Wang
Huamin Qu
13
75
0
17 Jul 2021
How Vulnerable Are Automatic Fake News Detection Methods to Adversarial Attacks?
Camille Koenders
Johannes Filla
Nicolai Schneider
Vinicius Woloszyn
GNN
22
15
0
16 Jul 2021
Intersectional Bias in Causal Language Models
Liam Magee
Lida Ghahremanlou
K. Soldatić
S. Robertson
191
31
0
16 Jul 2021
You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a (Mostly) Serverless and Open Stack
Jacopo Tagliabue
24
15
0
15 Jul 2021
Trusting RoBERTa over BERT: Insights from CheckListing the Natural Language Inference Task
Ishan Tarunesh
Somak Aditya
Monojit Choudhury
15
17
0
15 Jul 2021
Tailor: Generating and Perturbing Text with Semantic Controls
Alexis Ross
Tongshuang Wu
Hao Peng
Matthew E. Peters
Matt Gardner
136
77
0
15 Jul 2021
DaCy: A Unified Framework for Danish NLP
Kenneth C. Enevoldsen
Lasse Hansen
Kristoffer Laigaard Nielbo
29
13
0
12 Jul 2021
Machine Learning for Fraud Detection in E-Commerce: A Research Agenda
Niek Tax
Kees Jan de Vries
Mathijs de Jong
Nikoleta Dosoula
Bram van den Akker
Jon Smith
Olivier Thuong
Lucas Bernardi
6
19
0
05 Jul 2021
Mandoline: Model Evaluation under Distribution Shift
Mayee F. Chen
Karan Goel
N. Sohoni
Fait Poms
Kayvon Fatahalian
Christopher Ré
28
69
0
01 Jul 2021
Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis
Linyi Yang
Jiazheng Li
Padraig Cunningham
Yue Zhang
Barry Smyth
Ruihai Dong
11
47
0
29 Jun 2021
Quantifying Social Biases in NLP: A Generalization and Empirical Comparison of Extrinsic Fairness Metrics
Paula Czarnowska
Yogarshi Vyas
Kashif Shah
21
104
0
28 Jun 2021
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
Simon Mille
Kaustubh D. Dhole
Saad Mahamood
Laura Perez-Beltrachini
Varun Gangal
Mihir Kale
Emiel van Miltenburg
Sebastian Gehrmann
ELM
42
22
0
16 Jun 2021
Efficient (Soft) Q-Learning for Text Generation with Limited Good Data
Han Guo
Bowen Tan
Zhengzhong Liu
Eric P. Xing
Zhiting Hu
OffRL
33
33
0
14 Jun 2021
Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP
Anthony Chen
Pallavi Gudipati
Shayne Longpre
Xiao Ling
Sameer Singh
17
38
0
12 Jun 2021
How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation
Swaroop Mishra
Anjana Arunkumar
34
24
0
10 Jun 2021
Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions
Daniel Rosenberg
Itai Gat
Amir Feder
Roi Reichart
AAML
39
16
0
08 Jun 2021
PROST: Physical Reasoning of Objects through Space and Time
Stéphane Aroca-Ouellette
Cory Paik
A. Roncone
Katharina Kann
LRM
19
46
0
07 Jun 2021
Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia
Jiao Sun
Nanyun Peng
14
46
0
03 Jun 2021
Posthoc Verification and the Fallibility of the Ground Truth
Yifan Ding
Nicholas Botzer
Tim Weninger
11
5
0
02 Jun 2021
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests
Victor Veitch
Alexander DÁmour
Steve Yadlowsky
Jacob Eisenstein
OOD
24
91
0
31 May 2021
Changing the World by Changing the Data
Anna Rogers
16
71
0
28 May 2021
Contrastive Fine-tuning Improves Robustness for Neural Rankers
Xiaofei Ma
Cicero Nogueira dos Santos
Andrew O. Arnold
18
20
0
27 May 2021
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
Zhiyi Ma
Kawin Ethayarajh
Tristan Thrush
Somya Jain
Ledell Yu Wu
Robin Jia
Christopher Potts
Adina Williams
Douwe Kiela
ELM
33
57
0
21 May 2021
Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence
Jian Guan
Xiaoxi Mao
Changjie Fan
Zitao Liu
Wenbiao Ding
Minlie Huang
AuLLM
29
78
0
19 May 2021
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
Jian Guan
Zhexin Zhang
Zhuoer Feng
Zitao Liu
Wenbiao Ding
Xiaoxi Mao
Changjie Fan
Minlie Huang
12
60
0
19 May 2021
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
Ruiqi Zhong
Dhruba Ghosh
Dan Klein
Jacob Steinhardt
33
35
0
13 May 2021
Designing Multimodal Datasets for NLP Challenges
James Pustejovsky
E. Holderness
Jingxuan Tu
Parker Glenn
Kyeongmin Rim
Kelley Lynch
R. Brutti
23
5
0
12 May 2021
How Reliable are Model Diagnostics?
V. Aribandi
Yi Tay
Donald Metzler
19
19
0
12 May 2021
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization
Damien Teney
Ehsan Abbasnejad
Simon Lucey
Anton Van Den Hengel
28
87
0
12 May 2021
D2S: Document-to-Slide Generation Via Query-Based Text Summarization
Edward Sun
Yufang Hou
Dakuo Wang
Yunfeng Zhang
N. Wang
27
34
0
08 May 2021
Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates
Yuqing Xie
Yi-An Lai
Yuanjun Xiong
Yi Zhang
Stefano Soatto
UQCV
19
16
0
07 May 2021
Reliability Testing for Natural Language Processing Systems
Samson Tan
Chenyu You
K. Baxter
Araz Taeihagh
G. Bennett
Min-Yen Kan
15
38
0
06 May 2021
Do Natural Language Explanations Represent Valid Logical Arguments? Verifying Entailment in Explainable NLI Gold Standards
Marco Valentino
Ian Pratt-Hartman
André Freitas
XAI
LRM
21
12
0
05 May 2021
Russian News Clustering and Headline Selection Shared Task
I. Gusev
I. Smurov
21
7
0
03 May 2021
Explanation-Based Human Debugging of NLP Models: A Survey
Piyawat Lertvittayakumjorn
Francesca Toni
LRM
42
79
0
30 Apr 2021
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
Qinyuan Ye
Bill Yuchen Lin
Xiang Ren
220
180
0
18 Apr 2021
Learning with Instance Bundles for Reading Comprehension
Dheeru Dua
Pradeep Dasigi
Sameer Singh
Matt Gardner
37
11
0
18 Apr 2021
Revealing Persona Biases in Dialogue Systems
Emily Sheng
Josh Arnold
Zhou Yu
Kai-Wei Chang
Nanyun Peng
25
37
0
18 Apr 2021
Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation
Max Bartolo
Tristan Thrush
Robin Jia
Sebastian Riedel
Pontus Stenetorp
Douwe Kiela
AAML
28
103
0
18 Apr 2021
Sometimes We Want Translationese
Prasanna Parthasarathi
Koustuv Sinha
J. Pineau
Adina Williams
AAML
22
4
0
15 Apr 2021
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
Sebastian Ruder
Noah Constant
Jan A. Botha
Aditya Siddhant
Orhan Firat
...
Pengfei Liu
Junjie Hu
Dan Garrette
Graham Neubig
Melvin Johnson
ELM
AAML
LRM
24
184
0
15 Apr 2021
On the Robustness of Intent Classification and Slot Labeling in Goal-oriented Dialog Systems to Real-world Noise
Sailik Sengupta
Jason Krone
Saab Mansour
NoLa
11
12
0
14 Apr 2021
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation
Chong Zhang
Jieyu Zhao
Huan Zhang
Kai-Wei Chang
Cho-Jui Hsieh
AAML
24
10
0
12 Apr 2021
Previous
1
2
3
...
11
12
13
14
Next