Beyond Accuracy: Behavioral Testing of NLP models with CheckList

8 May 2020

Tongshuang Wu

Papers citing "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"

50 / 664 papers shown

Title
Evaluating Out-of-Distribution Performance on Document Image Classifiers Stefan Larson Gordon Lim Yutong Ai David Kuang Kevin Leach OODD OOD 37 18 0 14 Oct 2022
Predicting Fine-Tuning Performance with Probing Zining Zhu Soroosh Shahtalebi Frank Rudzicz 30 9 0 13 Oct 2022
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models Jimin Sun Patrick Fernandes Xinyi Wang Graham Neubig 35 9 0 13 Oct 2022
Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole Robin Jia ALM 32 9 0 13 Oct 2022
SEAL : Interactive Tool for Systematic Error Analysis and Labeling Nazneen Rajani Weixin Liang Lingjiao Chen Margaret Mitchell James Zou 48 16 0 11 Oct 2022
Checks and Strategies for Enabling Code-Switched Machine Translation Thamme Gowda Mozhdeh Gheini Jonathan May 30 3 0 11 Oct 2022
REV: Information-Theoretic Evaluation of Free-Text Rationales Hanjie Chen Faeze Brahman Xiang Ren Yangfeng Ji Yejin Choi Swabha Swayamdipta 92 23 0 10 Oct 2022
Montague semantics and modifier consistency measurement in neural language models Danilo S. Carvalho Edoardo Manino Julia Rozanova Lucas C. Cordeiro André Freitas 24 0 0 10 Oct 2022
CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation Tanay Dixit Bhargavi Paranjape Hannaneh Hajishirzi Luke Zettlemoyer SyDa 146 24 0 10 Oct 2022
Quantifying Social Biases Using Templates is Unreliable P. Seshadri Pouya Pezeshkpour Sameer Singh 51 33 0 09 Oct 2022
Artificial Intelligence and Natural Language Processing and Understanding in Space: A Methodological Framework and Four ESA Case Studies José Manuél Gómez-Pérez Andrés García-Silva R. Leone M. Albani Moritz Fontaine C. Poncet L. Summerer A. Donati Ilaria Roma Stefano Scaglioni 18 1 0 07 Oct 2022
Using Interventions to Improve Out-of-Distribution Generalization of Text-Matching Recommendation Systems Parikshit Bansal Yashoteja Prabhu Emre Kıcıman Amit Sharma CML OOD 33 0 0 07 Oct 2022
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation Thinh Hung Truong Yulia Otmakhova Tim Baldwin Trevor Cohn Jey Han Lau Karin Verspoor 65 21 0 06 Oct 2022
InferES : A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev Mariona Taulé 33 4 0 06 Oct 2022
State-of-the-art generalisation research in NLP: A taxonomy and review Dieuwke Hupkes Mario Giulianelli Verna Dankers Mikel Artetxe Yanai Elazar ... Leila Khalatbari Maria Ryskina Rita Frieske Ryan Cotterell Zhijing Jin 129 95 0 06 Oct 2022
Are Synonym Substitution Attacks Really Synonym Substitution Attacks? Cheng-Han Chiang Hunghuei Lee AAML 33 5 0 06 Oct 2022
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes Ke Shen Mayank Kejriwal 27 4 0 03 Oct 2022
Unpacking Large Language Models with Conceptual Consistency Pritish Sahu Michael Cogswell Yunye Gong Ajay Divakaran LRM 87 16 0 29 Sep 2022
Neural Media Bias Detection Using Distant Supervision With BABE -- Bias Annotations By Experts Timo Spinde Manuel Plank Jan-David Krieger Terry Ruas Bela Gipp Akiko Aizawa 27 68 0 29 Sep 2022
An Interdisciplinary Perspective on Evaluation and Experimental Design for Visual Text Analytics: Position Paper Kostiantyn Kucher N. Sultanum Angel Daza Vasiliki Simaki Maria Skeppstedt Barbara Plank Jean-Daniel Fekete Narges Mahyar 25 4 0 23 Sep 2022
Automatic Error Analysis for Document-level Information Extraction Aliva Das Xinya Du Barry Wang Kejian Shi J. Gu Thomas Porter Claire Cardie 26 10 0 15 Sep 2022
The Role of Explanatory Value in Natural Language Processing Kees van Deemter XAI 18 0 0 13 Sep 2022
On Faithfulness and Coherence of Language Explanations for Recommendation Systems Zhouhang Xie Julian McAuley Bodhisattwa Prasad Majumder LRM 35 1 0 12 Sep 2022
DECK: Behavioral Tests to Improve Interpretability and Generalizability of BERT Models Detecting Depression from Text Jekaterina Novikova Ksenia Shkaruta AI4MH 35 4 0 12 Sep 2022
Increasing Adverse Drug Events extraction robustness on social media: case study on negation and speculation Simone Scaboro Beatrice Portelli Emmanuele Chersoni Enrico Santus G. Serra 32 5 0 06 Sep 2022
A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension Xanh Ho Johannes Mario Meissner Saku Sugawara Akiko Aizawa OffRL 35 4 0 05 Sep 2022
Generating Intermediate Steps for NLI with Next-Step Supervision Deepanway Ghosal Somak Aditya Monojit Choudhury LRM 35 1 0 31 Aug 2022
Shortcut Learning of Large Language Models in Natural Language Understanding Mengnan Du Fengxiang He Na Zou Dacheng Tao Xia Hu KELM OffRL 42 84 0 25 Aug 2022
PSSAT: A Perturbed Semantic Structure Awareness Transferring Method for Perturbation-Robust Slot Filling Guanting Dong Daichi Guo Liwen Wang Xuefeng Li Zechen Wang ... Hao Lei Xinyue Cui Yi Huang Junlan Feng Weiran Xu 21 12 0 24 Aug 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 231 447 0 23 Aug 2022
KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models Haris Widjaja Kiril Gashteovski Wiem Ben-Rim Pengfei Liu Christopher Malon Daniel Ruffinelli Carolin (Haas) Lawrence Graham Neubig 25 5 0 23 Aug 2022
UKP-SQuARE v2: Explainability and Adversarial Attacks for Trustworthy QA Rachneet Sachdeva Haritz Puerto Tim Baumgärtner Sewin Tariverdian Hao Zhang Kexin Wang H. Saad Leonardo F. R. Ribeiro Iryna Gurevych AAML 18 2 0 19 Aug 2022
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning Olivia Wiles Isabela Albuquerque Sven Gowal VLM 43 47 0 18 Aug 2022
MENLI: Robust Evaluation Metrics from Natural Language Inference Yanran Chen Steffen Eger 32 16 0 15 Aug 2022
Patching open-vocabulary models by interpolating weights Gabriel Ilharco Mitchell Wortsman S. Gadre Shuran Song Hannaneh Hajishirzi Simon Kornblith Ali Farhadi Ludwig Schmidt VLM KELM 32 167 0 10 Aug 2022
Generating Coherent Narratives by Learning Dynamic and Discrete Entity States with a Contrastive Framework Jian Guan Zhenyu Yang Rongsheng Zhang Zhipeng Hu Minlie Huang 26 9 0 08 Aug 2022
A Holistic Approach to Undesired Content Detection in the Real World Todor Markov Chong Zhang Sandhini Agarwal Tyna Eloundou Teddy Lee Steven Adler Angela Jiang L. Weng 22 211 0 05 Aug 2022
ACE: Adaptive Constraint-aware Early Stopping in Hyperparameter Optimization Yi-Wei Chen Chi Wang A. Saied Rui Zhuang 19 2 0 04 Aug 2022
Unit Testing for Concepts in Neural Networks Charles Lovering Ellie Pavlick 25 28 0 28 Jul 2022
An Interpretability Evaluation Benchmark for Pre-trained Language Models Ya-Ming Shen Lijie Wang Ying-Cong Chen Xinyan Xiao Jing Liu Hua Wu 37 4 0 28 Jul 2022
A Survey of Intent Classification and Slot-Filling Datasets for Task-Oriented Dialog Stefan Larson Kevin Leach 41 20 0 26 Jul 2022
Human-Centric Research for NLP: Towards a Definition and Guiding Questions Bhushan Kotnis Kiril Gashteovski J. Gastinger G. Serra Francesco Alesiani T. Sztyler Ammar Shaker Na Gong Carolin (Haas) Lawrence Zhao Xu 25 9 0 10 Jul 2022
Probing Classifiers are Unreliable for Concept Removal and Detection Abhinav Kumar Chenhao Tan Amit Sharma AAML 34 21 0 08 Jul 2022
The "Collections as ML Data" Checklist for Machine Learning & Cultural Heritage Benjamin Charles Germain Lee VLM 16 7 0 06 Jul 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Tiancheng Zhao Tianqi Zhang Mingwei Zhu Haozhan Shen Kyusong Lee Xiaopeng Lu Jianwei Yin VLM CoGe MLLM 47 91 0 01 Jul 2022
longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks Venelin Kovatchev Trina Chatterjee Venkata S Govindarajan Jifan Chen Eunsol Choi ... K. Erk Matthew Lease Junyi Jessy Li Yating Wu Kyle Mahowald AAML ELM 19 10 0 29 Jun 2022
Plug and Play Counterfactual Text Generation for Model Robustness Nishtha Madaan Srikanta J. Bedathur Diptikalyan Saha 31 4 0 21 Jun 2022
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models Paul Röttger Haitham Seelawi Debora Nozza Zeerak Talat Bertie Vidgen 30 65 0 20 Jun 2022
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models Maribeth Rauh John F. J. Mellor J. Uesato Po-Sen Huang Johannes Welbl ... Amelia Glaese G. Irving Iason Gabriel William S. Isaac Lisa Anne Hendricks 33 49 0 16 Jun 2022
"Understanding Robustness Lottery": A Geometric Visual Comparative Analysis of Neural Network Pruning Approaches Zhimin Li Shusen Liu Xin Yu Kailkhura Bhavya Jie Cao Diffenderfer James Daniel P. Bremer Valerio Pascucci AAML 29 1 0 16 Jun 2022