Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
RoMe: A Robust Metric for Evaluating Natural Language Generation
Md. Rony
Liubov Kovriguina
Debanjan Chaudhuri
Ricardo Usbeck
Jens Lehmann
22
12
0
17 Mar 2022
An Analysis of Negation in Natural Language Understanding Corpora
Md Mosharaf Hossain
Dhivya Chinnappa
Eduardo Blanco
16
42
0
16 Mar 2022
Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness
Tejas Gokhale
Swaroop Mishra
Man Luo
Bhavdeep Singh Sachdeva
Chitta Baral
52
29
0
15 Mar 2022
CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
Carlos E. Jimenez
Olga Russakovsky
Karthik Narasimhan
CoGe
29
14
0
15 Mar 2022
Dawn of the transformer era in speech emotion recognition: closing the valence gap
Johannes Wagner
Andreas Triantafyllopoulos
H. Wierstorf
Maximilian Schmitt
Felix Burkhardt
F. Eyben
Björn W. Schuller
15
284
0
14 Mar 2022
What Makes Reading Comprehension Questions Difficult?
Saku Sugawara
Nikita Nangia
Alex Warstadt
Sam Bowman
ELM
RALM
20
13
0
12 Mar 2022
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
32
36
0
09 Mar 2022
iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models
Jun Yuan
Jesse Vig
Nazneen Rajani
19
13
0
08 Mar 2022
On the data requirements of probing
Zining Zhu
Jixuan Wang
Bai Li
Frank Rudzicz
27
5
0
25 Feb 2022
XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning
Marc-André Zöller
Waldemar Titov
T. Schlegel
Marco F. Huber
HAI
11
9
0
24 Feb 2022
Hierarchical Interpretation of Neural Text Classification
Hanqi Yan
Lin Gui
Yulan He
45
14
0
20 Feb 2022
Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions
Qixiang Fang
D. Nguyen
Daniel L. Oberski
27
12
0
18 Feb 2022
XAI for Transformers: Better Explanations through Conservative Propagation
Ameen Ali
Thomas Schnake
Oliver Eberle
G. Montavon
Klaus-Robert Muller
Lior Wolf
FAtt
15
89
0
15 Feb 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Sebastian Gehrmann
Elizabeth Clark
Thibault Sellam
ELM
AI4CE
69
184
0
14 Feb 2022
Counterfactual Multi-Token Fairness in Text Classification
P. Lohia
21
3
0
08 Feb 2022
Red Teaming Language Models with Language Models
Ethan Perez
Saffron Huang
Francis Song
Trevor Cai
Roman Ring
John Aslanides
Amelia Glaese
Nat McAleese
G. Irving
AAML
13
610
0
07 Feb 2022
Measuring and Reducing Model Update Regression in Structured Prediction for NLP
Deng Cai
Elman Mansimov
Yi-An Lai
Yixuan Su
Lei Shu
Yi Zhang
KELM
67
8
0
07 Feb 2022
Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities
Xin Du
Bénédicte Legastelois
B. Ganesh
A. Rajan
Hana Chockler
Vaishak Belle
Stuart Anderson
S. Ramamoorthy
AAML
27
6
0
27 Jan 2022
Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns
Reto Gubelmann
Siegfried Handschuh
ReLM
LRM
38
6
0
19 Jan 2022
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Alisa Liu
Swabha Swayamdipta
Noah A. Smith
Yejin Choi
82
212
0
16 Jan 2022
Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions
Marwan Omar
Soohyeon Choi
Daehun Nyang
David A. Mohaisen
32
57
0
03 Jan 2022
On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations
Aamir Miyajiwala
Arnav Ladkat
Samiksha Jagadale
Raviraj Joshi
AAML
17
7
0
02 Jan 2022
Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card Messages
Jiao Sun
Tongshuang Wu
Yue Jiang
Ronil Awalegaonkar
Xi Lin
Diyi Yang
13
8
0
28 Dec 2021
An Interdisciplinary Approach for the Automated Detection and Visualization of Media Bias in News Articles
Timo Spinde
30
13
0
26 Dec 2021
More Than Words: Towards Better Quality Interpretations of Text Classifiers
Muhammad Bilal Zafar
Philipp Schmidt
Michele Donini
Cédric Archambeau
F. Biessmann
Sanjiv Ranjan Das
K. Kenthapadi
FAtt
19
5
0
23 Dec 2021
Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction
Dongfang Li
Baotian Hu
Qingcai Chen
Tujie Xu
Jingcong Tao
Yunan Zhang
32
12
0
20 Dec 2021
Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
Kyle Richardson
Ashish Sabharwal
ReLM
LRM
30
24
0
16 Dec 2021
DuQM: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models
Hongyu Zhu
Yan Chen
Jing Yang
Jing Liu
Yu Hong
Ying-Cong Chen
Hua Wu
Haifeng Wang
AAML
25
6
0
16 Dec 2021
Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics
Hyundong Justin Cho
Chinnadhurai Sankar
Christopher Lin
Kaushik Ram Sadagopan
Shahin Shayandeh
Asli Celikyilmaz
Jonathan May
Ahmad Beirami
60
10
0
15 Dec 2021
Measure and Improve Robustness in NLP Models: A Survey
Xuezhi Wang
Haohan Wang
Diyi Yang
139
130
0
15 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
29
109
0
14 Dec 2021
The King is Naked: on the Notion of Robustness for Natural Language Processing
Emanuele La Malfa
Marta Z. Kwiatkowska
20
28
0
13 Dec 2021
Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation
Raymond Li
Wen Xiao
Linzi Xing
Lanjun Wang
Gabriel Murray
Giuseppe Carenini
ViT
27
7
0
10 Dec 2021
Thinking Beyond Distributions in Testing Machine Learned Models
Negar Rostamzadeh
B. Hutchinson
Christina Greer
Vinodkumar Prabhakaran
TTA
40
6
0
06 Dec 2021
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh D. Dhole
Varun Gangal
Sebastian Gehrmann
Aadesh Gupta
Zhenhao Li
...
Tianbao Xie
Usama Yaseen
Michael A. Yee
Jing Zhang
Yue Zhang
174
86
0
06 Dec 2021
Toward a Taxonomy of Trust for Probabilistic Machine Learning
Tamara Broderick
Andrew Gelman
Rachael Meager
Anna L. Smith
Tian Zheng
34
9
0
05 Dec 2021
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI
Ishan Tarunesh
Somak Aditya
Monojit Choudhury
ELM
LRM
31
4
0
04 Dec 2021
True or False: Does the Deep Learning Model Learn to Detect Rumors?
Shiwen Ni
Jiawen Li
Hung-Yu kao
16
3
0
01 Dec 2021
What Do You See in this Patient? Behavioral Testing of Clinical NLP Models
Betty van Aken
S. Herrmann
Alexander Loser
26
11
0
30 Nov 2021
AI and the Everything in the Whole Wide World Benchmark
Inioluwa Deborah Raji
Emily M. Bender
Amandalynne Paullada
Emily L. Denton
A. Hanna
30
291
0
26 Nov 2021
True Few-Shot Learning with Prompts -- A Real-World Perspective
Timo Schick
Hinrich Schütze
VLM
27
64
0
26 Nov 2021
Network representation learning: A macro and micro view
Xueyi Liu
Jie Tang
GNN
AI4TS
19
23
0
21 Nov 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning
Keng Ji Chow
Samson Tan
MingSung Kan
LRM
26
4
0
21 Nov 2021
Beyond NDCG: behavioral testing of recommender systems with RecList
P. Chia
Jacopo Tagliabue
Federico Bianchi
Chloe He
Brian Ko
27
27
0
18 Nov 2021
How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task
Urja Khurana
Eric T. Nalisnick
Antske Fokkens
MoMe
35
6
0
18 Nov 2021
Interpreting Language Models Through Knowledge Graph Extraction
Vinitra Swamy
Angelika Romanou
Martin Jaggi
30
20
0
16 Nov 2021
STAMP 4 NLP -- An Agile Framework for Rapid Quality-Driven NLP Applications Development
Philipp Kohl
Oliver Schmidts
Lars Klöser
H. Werth
Bodo Kraft
Albert Zündorf
VLM
19
1
0
16 Nov 2021
Identification of Fine-Grained Location Mentions in Crisis Tweets
Sarthak Khanal
Maria Traskowsky
Doina Caragea
16
4
0
11 Nov 2021
NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation
David Alfonso-Hermelo
Ahmad Rashid
Abbas Ghaddar
Huawei Noah’s
Mehdi Rezagholizadeh
37
2
0
09 Nov 2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
Wei Ping
Chejian Xu
Shuohang Wang
Zhe Gan
Yu Cheng
Jianfeng Gao
Ahmed Hassan Awadallah
Bohao Li
VLM
ELM
AAML
33
215
0
04 Nov 2021
Previous
1
2
3
...
10
11
12
13
14
9
Next