Beyond Accuracy: Behavioral Testing of NLP models with CheckList

8 May 2020

Tongshuang Wu

Papers citing "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"

50 / 664 papers shown

Title
RoMe: A Robust Metric for Evaluating Natural Language Generation Md. Rony Liubov Kovriguina Debanjan Chaudhuri Ricardo Usbeck Jens Lehmann 22 12 0 17 Mar 2022
An Analysis of Negation in Natural Language Understanding Corpora Md Mosharaf Hossain Dhivya Chinnappa Eduardo Blanco 16 42 0 16 Mar 2022
Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness Tejas Gokhale Swaroop Mishra Man Luo Bhavdeep Singh Sachdeva Chitta Baral 52 29 0 15 Mar 2022
CARETS: A Consistency And Robustness Evaluative Test Suite for VQA Carlos E. Jimenez Olga Russakovsky Karthik Narasimhan CoGe 29 14 0 15 Mar 2022
Dawn of the transformer era in speech emotion recognition: closing the valence gap Johannes Wagner Andreas Triantafyllopoulos H. Wierstorf Maximilian Schmitt Felix Burkhardt F. Eyben Björn W. Schuller 15 284 0 14 Mar 2022
What Makes Reading Comprehension Questions Difficult? Saku Sugawara Nikita Nangia Alex Warstadt Sam Bowman ELM RALM 20 13 0 12 Mar 2022
Mapping global dynamics of benchmark creation and saturation in artificial intelligence Simon Ott A. Barbosa-Silva Kathrin Blagec J. Brauner Matthias Samwald 32 36 0 09 Mar 2022
iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models Jun Yuan Jesse Vig Nazneen Rajani 19 13 0 08 Mar 2022
On the data requirements of probing Zining Zhu Jixuan Wang Bai Li Frank Rudzicz 27 5 0 25 Feb 2022
XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning Marc-André Zöller Waldemar Titov T. Schlegel Marco F. Huber HAI 11 9 0 24 Feb 2022
Hierarchical Interpretation of Neural Text Classification Hanqi Yan Lin Gui Yulan He 45 14 0 20 Feb 2022
Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions Qixiang Fang D. Nguyen Daniel L. Oberski 27 12 0 18 Feb 2022
XAI for Transformers: Better Explanations through Conservative Propagation Ameen Ali Thomas Schnake Oliver Eberle G. Montavon Klaus-Robert Muller Lior Wolf FAtt 15 89 0 15 Feb 2022
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text Sebastian Gehrmann Elizabeth Clark Thibault Sellam ELM AI4CE 69 184 0 14 Feb 2022
Counterfactual Multi-Token Fairness in Text Classification P. Lohia 21 3 0 08 Feb 2022
Red Teaming Language Models with Language Models Ethan Perez Saffron Huang Francis Song Trevor Cai Roman Ring John Aslanides Amelia Glaese Nat McAleese G. Irving AAML 13 610 0 07 Feb 2022
Measuring and Reducing Model Update Regression in Structured Prediction for NLP Deng Cai Elman Mansimov Yi-An Lai Yixuan Su Lei Shu Yi Zhang KELM 67 8 0 07 Feb 2022
Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities Xin Du Bénédicte Legastelois B. Ganesh A. Rajan Hana Chockler Vaishak Belle Stuart Anderson S. Ramamoorthy AAML 27 6 0 27 Jan 2022
Uncovering More Shallow Heuristics: Probing the Natural Language Inference Capacities of Transformer-Based Pre-Trained Language Models Using Syllogistic Patterns Reto Gubelmann Siegfried Handschuh ReLM LRM 38 6 0 19 Jan 2022
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation Alisa Liu Swabha Swayamdipta Noah A. Smith Yejin Choi 82 212 0 16 Jan 2022
Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions Marwan Omar Soohyeon Choi Daehun Nyang David A. Mohaisen 32 57 0 03 Jan 2022
On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations Aamir Miyajiwala Arnav Ladkat Samiksha Jagadale Raviraj Joshi AAML 17 7 0 02 Jan 2022
Pretty Princess vs. Successful Leader: Gender Roles in Greeting Card Messages Jiao Sun Tongshuang Wu Yue Jiang Ronil Awalegaonkar Xi Lin Diyi Yang 13 8 0 28 Dec 2021
An Interdisciplinary Approach for the Automated Detection and Visualization of Media Bias in News Articles Timo Spinde 30 13 0 26 Dec 2021
More Than Words: Towards Better Quality Interpretations of Text Classifiers Muhammad Bilal Zafar Philipp Schmidt Michele Donini Cédric Archambeau F. Biessmann Sanjiv Ranjan Das K. Kenthapadi FAtt 19 5 0 23 Dec 2021
Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction Dongfang Li Baotian Hu Qingcai Chen Tujie Xu Jingcong Tao Yunan Zhang 32 12 0 20 Dec 2021
Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability Kyle Richardson Ashish Sabharwal ReLM LRM 30 24 0 16 Dec 2021
DuQM: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models Hongyu Zhu Yan Chen Jing Yang Jing Liu Yu Hong Ying-Cong Chen Hua Wu Haifeng Wang AAML 25 6 0 16 Dec 2021
Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics Hyundong Justin Cho Chinnadhurai Sankar Christopher Lin Kaushik Ram Sadagopan Shahin Shayandeh Asli Celikyilmaz Jonathan May Ahmad Beirami 60 10 0 15 Dec 2021
Measure and Improve Robustness in NLP Models: A Survey Xuezhi Wang Haohan Wang Diyi Yang 139 130 0 15 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena Letitia Parcalabescu Michele Cafagna Lilitta Muradjan Anette Frank Iacer Calixto Albert Gatt CoGe 29 109 0 14 Dec 2021
The King is Naked: on the Notion of Robustness for Natural Language Processing Emanuele La Malfa Marta Z. Kwiatkowska 20 28 0 13 Dec 2021
Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation Raymond Li Wen Xiao Linzi Xing Lanjun Wang Gabriel Murray Giuseppe Carenini ViT 27 7 0 10 Dec 2021
Thinking Beyond Distributions in Testing Machine Learned Models Negar Rostamzadeh B. Hutchinson Christina Greer Vinodkumar Prabhakaran TTA 40 6 0 06 Dec 2021
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation Kaustubh D. Dhole Varun Gangal Sebastian Gehrmann Aadesh Gupta Zhenhao Li ... Tianbao Xie Usama Yaseen Michael A. Yee Jing Zhang Yue Zhang 174 86 0 06 Dec 2021
Toward a Taxonomy of Trust for Probabilistic Machine Learning Tamara Broderick Andrew Gelman Rachael Meager Anna L. Smith Tian Zheng 34 9 0 05 Dec 2021
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI Ishan Tarunesh Somak Aditya Monojit Choudhury ELM LRM 31 4 0 04 Dec 2021
True or False: Does the Deep Learning Model Learn to Detect Rumors? Shiwen Ni Jiawen Li Hung-Yu kao 16 3 0 01 Dec 2021
What Do You See in this Patient? Behavioral Testing of Clinical NLP Models Betty van Aken S. Herrmann Alexander Loser 26 11 0 30 Nov 2021
AI and the Everything in the Whole Wide World Benchmark Inioluwa Deborah Raji Emily M. Bender Amandalynne Paullada Emily L. Denton A. Hanna 30 291 0 26 Nov 2021
True Few-Shot Learning with Prompts -- A Real-World Perspective Timo Schick Hinrich Schütze VLM 27 64 0 26 Nov 2021
Network representation learning: A macro and micro view Xueyi Liu Jie Tang GNN AI4TS 19 23 0 21 Nov 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning Keng Ji Chow Samson Tan MingSung Kan LRM 26 4 0 21 Nov 2021
Beyond NDCG: behavioral testing of recommender systems with RecList P. Chia Jacopo Tagliabue Federico Bianchi Chloe He Brian Ko 27 27 0 18 Nov 2021
How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task Urja Khurana Eric T. Nalisnick Antske Fokkens MoMe 35 6 0 18 Nov 2021
Interpreting Language Models Through Knowledge Graph Extraction Vinitra Swamy Angelika Romanou Martin Jaggi 30 20 0 16 Nov 2021
STAMP 4 NLP -- An Agile Framework for Rapid Quality-Driven NLP Applications Development Philipp Kohl Oliver Schmidts Lars Klöser H. Werth Bodo Kraft Albert Zündorf VLM 19 1 0 16 Nov 2021
Identification of Fine-Grained Location Mentions in Crisis Tweets Sarthak Khanal Maria Traskowsky Doina Caragea 16 4 0 11 Nov 2021
NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation David Alfonso-Hermelo Ahmad Rashid Abbas Ghaddar Huawei Noah’s Mehdi Rezagholizadeh 37 2 0 09 Nov 2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models Wei Ping Chejian Xu Shuohang Wang Zhe Gan Yu Cheng Jianfeng Gao Ahmed Hassan Awadallah Bohao Li VLM ELM AAML 33 215 0 04 Nov 2021