Beyond Accuracy: Behavioral Testing of NLP models with CheckList

8 May 2020

Tongshuang Wu

Papers citing "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"

50 / 664 papers shown

Title
Data Synthesis for Testing Black-Box Machine Learning Models Diptikalyan Saha Aniya Aggarwal Sandeep Hans 22 4 0 03 Nov 2021
Template Filling for Controllable Commonsense Reasoning Dheeraj Rajagopal Vivek Khetan Bogdan Sacaleanu A. Gershman Andy E. Fano Eduard H. Hovy BDL LRM 25 6 0 31 Oct 2021
DAG Card is the new Model Card Jacopo Tagliabue Ville Tuulos C. Greco Valay Dave SyDa 39 11 0 24 Oct 2021
Behavioral Experiments for Understanding Catastrophic Forgetting Samuel J. Bell Neil D. Lawrence 35 4 0 20 Oct 2021
AequeVox: Automated Fairness Testing of Speech Recognition Systems Sai Sathiesh Rajan Sakshi Udeshi Sudipta Chattopadhyay 28 15 0 19 Oct 2021
Label-Descriptive Patterns and Their Application to Characterizing Classification Errors Michael A. Hedderich Jonas Fischer Dietrich Klakow Jilles Vreeken 6 10 0 18 Oct 2021
Predicting the Performance of Multilingual NLP Models A. Srinivasan Sunayana Sitaram T. Ganu Sandipan Dandapat Kalika Bali Monojit Choudhury LRM 32 27 0 17 Oct 2021
On the Robustness of Reading Comprehension Models to Entity Renaming Jun Yan Yang Xiao Sagnik Mukherjee Bill Yuchen Lin Robin Jia Xiang Ren 16 20 0 16 Oct 2021
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail Sam Bowman OffRL 24 45 0 15 Oct 2021
Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models Tianlu Wang Rohit Sridhar Diyi Yang Xuezhi Wang AAML 120 72 0 14 Oct 2021
Retrieval-guided Counterfactual Generation for QA Bhargavi Paranjape Matthew Lamm Ian Tenney 33 31 0 14 Oct 2021
The Irrationality of Neural Rationale Models Yiming Zheng Serena Booth J. Shah Yilun Zhou 35 16 0 14 Oct 2021
PARE: A Simple and Strong Baseline for Monolingual and Multilingual Distantly Supervised Relation Extraction Vipul Rathore Kartikeya Badola Mausam Parag Singla 41 19 0 14 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language Inference Tejas Gokhale A. Chaudhary Pratyay Banerjee Chitta Baral Yezhou Yang 54 17 0 14 Oct 2021
Interpreting the Robustness of Neural NLP Models to Textual Perturbations Yunxiang Zhang Liangming Pan Samson Tan Min-Yen Kan 33 21 0 14 Oct 2021
AutoNLU: Detecting, root-causing, and fixing NLU model errors P. Sethi Denis Savenkov Forough Arabshahi Jack Goetz Micaela Tolliver Nicolas Scheffer I. Kabul Yue Liu Ahmed Aly 18 4 0 12 Oct 2021
Salient ImageNet: How to discover spurious features in Deep Learning? Sahil Singla S. Feizi AAML VLM 29 115 0 08 Oct 2021
Automated Testing of AI Models Swagatam Haldar Deepak Vijaykeerthy Diptikalyan Saha VLM 21 0 0 07 Oct 2021
GNN is a Counter? Revisiting GNN for Question Answering Kuan-Chieh Jackson Wang Yuyu Zhang Diyi Yang Le Song Tao Qin LMTD 29 30 0 07 Oct 2021
Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development Aspen K. Hopkins Serena Booth 29 45 0 06 Oct 2021
Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance Karthikeyan K Aalok Sathe Somak Aditya Monojit Choudhury LRM 33 10 0 05 Oct 2021
Trustworthy AI: From Principles to Practices Bo-wen Li Peng Qi Bo Liu Shuai Di Jingen Liu Jiquan Pei Jinfeng Yi Bowen Zhou 119 356 0 04 Oct 2021
Human-Centered AI for Data Science: A Systematic Approach Dakuo Wang Xiaojuan Ma A. Wang 17 3 0 03 Oct 2021
Enhancing Model Robustness and Fairness with Causality: A Regularization Approach Zhao Wang Kai Shu A. Culotta OOD 21 14 0 03 Oct 2021
Language Invariant Properties in Natural Language Processing Federico Bianchi Debora Nozza Dirk Hovy 55 3 0 27 Sep 2021
RuleBert: Teaching Soft Rules to Pre-trained Language Models Mohammed Saeed N. Ahmadi Preslav Nakov Paolo Papotti LRM 253 31 0 24 Sep 2021
Separating Retention from Extraction in the Evaluation of End-to-end Relation Extraction Bruno Taillé Vincent Guigue Geoffrey Scoutheeten Patrick Gallinari 79 5 0 24 Sep 2021
Robust Generalization of Quadratic Neural Networks via Function Identification Kan Xu Hamsa Bastani Osbert Bastani OOD 34 8 0 22 Sep 2021
Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation Diptesh Kanojia M. Fomicheva Tharindu Ranasinghe Frédéric Blain Constantin Oruasan Lucia Specia 18 10 0 22 Sep 2021
NADE: A Benchmark for Robust Adverse Drug Events Extraction in Face of Negations Simone Scaboro Beatrice Portelli Emmanuele Chersoni Enrico Santus G. Serra 25 9 0 21 Sep 2021
Types of Out-of-Distribution Texts and How to Detect Them Udit Arora William Huang He He OODD 225 97 0 14 Sep 2021
Tribrid: Stance Classification with Neural Inconsistency Detection Song Yang Jacopo Urbani 14 6 0 14 Sep 2021
SituatedQA: Incorporating Extra-Linguistic Contexts into QA Michael J.Q. Zhang Eunsol Choi RALM 32 136 0 13 Sep 2021
Perturbation CheckLists for Evaluating NLG Evaluation Metrics Ananya B. Sai Tanay Dixit D. Y. Sheth S. Mohan Mitesh M. Khapra AAML 116 57 0 13 Sep 2021
Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers Shane Storks J. Chai 51 5 0 10 Sep 2021
An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model Kijong Han Seojin Lee Wooin Lee Joosung Lee Donghun Lee AAML 25 5 0 10 Sep 2021
AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction Dong-Ho Lee Ravi Kiran Selvam Sheikh Muhammad Sarwar Bill Yuchen Lin Fred Morstatter Jay Pujara Elizabeth Boschee James Allan Xiang Ren 31 2 0 10 Sep 2021
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond Amir Feder Katherine A. Keith Emaad A. Manzoor Reid Pryzant Dhanya Sridhar ... Roi Reichart Margaret E. Roberts Brandon M Stewart Victor Veitch Diyi Yang CML 41 234 0 02 Sep 2021
DuTrust: A Sentiment Analysis Dataset for Trustworthiness Evaluation Lijie Wang Hao Liu Shu-ping Peng Hongxuan Tang Xinyan Xiao Ying-Cong Chen Hua Wu Haifeng Wang 25 5 0 30 Aug 2021
LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation Jian Guan Zhuoer Feng Yamei Chen Ru He Xiaoxi Mao Changjie Fan Minlie Huang 39 32 0 30 Aug 2021
HeadlineCause: A Dataset of News Headlines for Detecting Causalities I. Gusev Alexey Tikhonov CML 14 7 0 28 Aug 2021
Deep learning models are not robust against noise in clinical text M. Moradi Kathrin Blagec Matthias Samwald OOD 25 6 0 27 Aug 2021
Evaluating the Robustness of Neural Language Models to Input Perturbations M. Moradi Matthias Samwald AAML 48 95 0 27 Aug 2021
DoWhy: Addressing Challenges in Expressing and Validating Causal Assumptions Amit Sharma Vasilis Syrgkanis Cheng Zhang Emre Kıcıman 24 26 0 27 Aug 2021
ComSum: Commit Messages Summarization and Meaning Preservation Leshem Choshen Idan Amit 17 4 0 23 Aug 2021
Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models Myeongjun Jang D. Kwon Thomas Lukasiewicz 38 13 0 15 Aug 2021
Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems Laurel J. Orr Atindriyo Sanyal Xiao Ling Karan Goel Megan Leszczynski 25 18 0 11 Aug 2021
Using Metamorphic Relations to Verify and Enhance Artcode Classification Liming Xu Dave Towey Andrew P. French Steve Benford Z. Zhou T. Chen 19 8 0 05 Aug 2021
Underreporting of errors in NLG output, and what to do about it Emiel van Miltenburg Miruna Clinciu Ondrej Dusek Dimitra Gkatzia Stephanie Inglis ... Saad Mahamood Emma Manning S. Schoch Craig Thomson Luou Wen 27 38 0 02 Aug 2021
TabPert: An Effective Platform for Tabular Perturbation Nupur Jain Vivek Gupta Anshul Rai G. Kumar LMTD 14 5 0 02 Aug 2021