Beyond Accuracy: Behavioral Testing of NLP models with CheckList

8 May 2020

Tongshuang Wu

Papers citing "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"

50 / 664 papers shown

Title
Learning Repetition-Invariant Representations for Polymer Informatics Yihan Zhu Gang Liu Eric Inae Tengfei Luo Meng Jiang 17 0 0 15 May 2025
IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method Mihyeon Kim Juhyoung Park Youngbin Kim 34 0 0 11 May 2025
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification Leon Eshuijs Shihan Wang Antske Fokkens 26 0 0 09 May 2025
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents Takyoung Kim Janvijay Singh Shuhaib Mehri Emre Can Acikgoz Sagnik Mukherjee Nimet Beyza Bozdag Sumuk Shashidhar Gokhan Tur Dilek Hakkani-Tur LLMAG 29 0 0 02 May 2025
Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets Masumi Morishige Ryo Koshihara ALM 14 0 0 02 May 2025
$$\texttt{SAGE}$: A Generic Framework for LLM Safety Evaluation$ $\texttt{SAGE}$ : A Generic Framework for LLM Safety Evaluation Madhur Jindal Hari Shrawgi Parag Agrawal Sandipan Dandapat ELM 47 0 0 28 Apr 2025
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning Teeradaj Racharak Chaiyong Ragkhitwetsagul Chommakorn Sontesadisai Thanwadee Sunetnanta 40 0 0 26 Apr 2025
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs Mohammad Akbar-Tajari Mohammad Taher Pilehvar Mohammad Mahmoody AAML 48 0 0 26 Apr 2025
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation Yulia Otmakhova Hung Thinh Truong Rahmad Mahendra Zenan Zhai Rongxin Zhu Daniel Beck Jey Han Lau ELM 70 0 0 24 Apr 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns Michael A. Hedderich Anyi Wang Raoyuan Zhao Florian Eichin Barbara Plank 35 0 0 22 Apr 2025
aiXamine: Simplified LLM Safety and Security Fatih Deniz Dorde Popovic Yazan Boshmaf Euisuh Jeong M. Ahmad Sanjay Chawla Issa M. Khalil ELM 80 0 0 21 Apr 2025
Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations Yiyou Sun Y. Gai Lijie Chen Abhilasha Ravichander Yejin Choi D. Song HILM 57 0 0 17 Apr 2025
The Code Barrier: What LLMs Actually Understand? Serge Lionel Nikiema Jordan Samhi A. Kaboré Jacques Klein Tegawende F. Bissyande ELM 29 1 0 14 Apr 2025
Cognitive Debiasing Large Language Models for Decision-Making Yougang Lyu Shijie Ren Yue Feng Zihan Wang Z. Chen Z. Z. Ren Maarten de Rijke 41 0 0 05 Apr 2025
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study Aryan Agrawal Lisa Alazraki Shahin Honarvar Marek Rei 57 0 0 03 Apr 2025
Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach Hongliu Cao 67 0 0 01 Apr 2025
Pay More Attention to the Robustness of Prompt for Instruction Data Mining Qiang Wang Dawei Feng Xu Zhang Ao Shen Yang Xu Bo Ding H. Wang AAML 48 0 0 31 Mar 2025
On Explaining (Large) Language Models For Code Using Global Code-Based Explanations David Nader-Palacio Dipin Khati Daniel Rodríguez-Cárdenas Alejandro Velasco Denys Poshyvanyk LRM 47 0 0 21 Mar 2025
Model Risk Management for Generative AI In Financial Institutions Anwesha Bhattacharyya Ye Yu Hanyu Yang Rahul Singh Tarun Joshi Jie Chen Kiran Yalavarthy AIFin MedIm 46 0 0 19 Mar 2025
Prompt Sentiment: The Catalyst for LLM Change Vishal Gandhi Sagar Gandhi 52 1 0 14 Mar 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs Zhaofeng Wu Michihiro Yasunaga Andrew Cohen Yoon Kim Asli Celikyilmaz Marjan Ghazvininejad 46 2 0 14 Mar 2025
Toward an Evaluation Science for Generative AI Systems Laura Weidinger Deb Raji Hanna M. Wallach Margaret Mitchell Angelina Wang Olawale Salaudeen Rishi Bommasani Sayash Kapoor Deep Ganguli Sanmi Koyejo EGVM ELM 67 4 0 07 Mar 2025
AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models Hengrui Xing Cong Tian L. Zhao Z. Ma WenSheng Wang N. Zhang Chao Huang Zhenhua Duan 49 0 0 07 Mar 2025
The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats William Brach Kristián Košťál Michal Ries 197 0 0 04 Mar 2025
Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models Tabinda Sarwar Antonio Jose Jimeno Yepes Lawrence Cavedon 69 0 0 12 Feb 2025
SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation Saurabh Kumar Pandey S. Vashistha Debrup Das Somak Aditya Monojit Choudhury AAML 74 0 0 10 Feb 2025
A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks Elie Antoine Frédéric Béchet Géraldine Damnati Philippe Langlais 56 1 0 29 Jan 2025
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models Ran Xu Hejie Cui Yue Yu Xuan Kan Wenqi Shi Yuchen Zhuang Wei Jin Joyce C. Ho Carl Yang 69 14 0 28 Jan 2025
Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics Ameya Godbole Robin Jia HILM 53 1 0 24 Jan 2025
Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts Preethi Seshadri Seraphina Goldfarb-Tarrant 40 1 0 08 Jan 2025
Predictable Artificial Intelligence Lexin Zhou Pablo Antonio Moreno Casares Fernando Martínez-Plumed John Burden Ryan Burnell ... Seán Ó hÉigeartaigh Danaja Rutar Wout Schellaert Konstantinos Voudouris José Hernández-Orallo 51 2 0 08 Jan 2025
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets Vatsal Gupta Pranshu Pandya Tushar Kataria Vivek Gupta Dan Roth AAML 57 1 0 03 Jan 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models Yulei Qin Yuncheng Yang Pengcheng Guo Gang Li Hang Shao Yuchen Shi Zihan Xu Yun Gu Ke Li Xing Sun ALM 93 12 0 31 Dec 2024
The Evolution of LLM Adoption in Industry Data Curation Practices Crystal Qian Michael Xieyang Liu Emily Reif Grady Simon Nada Hussein Nathan Clement James Wexler Carrie J. Cai Michael Terry Minsuk Kahng AILaw ELM 77 4 0 20 Dec 2024
Unpacking the Resilience of SNLI Contradiction Examples to Attacks Chetan Verma Archit Agarwal AAML 74 0 0 15 Dec 2024
Human-Centric NLP or AI-Centric Illusion?: A Critical Investigation Piyapath T Spencer 80 0 0 14 Dec 2024
Neural Text Normalization for Luxembourgish using Real-Life Variation Data Anne-Marie Lutgen Alistair Plum Christoph Purschke Barbara Plank 72 1 0 12 Dec 2024
Improving Object Detection by Modifying Synthetic Data with Explainable AI Nitish Mital Simon Malzard Richard Walters Celso M. De Melo Raghuveer Rao Victoria Nockles 80 0 0 02 Dec 2024
Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection Shanu Kumar Saish Mendke Karody Lubna Abdul Rahman Santosh Kurasa Parag Agrawal Sandipan Dandapat LLMAG LRM 70 2 0 30 Nov 2024
Interactive Visual Assessment for Text-to-Image Generation Models Xiaoyue Mi Fan Tang Juan Cao Qiang Sheng Ziyao Huang Peng Li Yi Liu Tong-Yee Lee EGVM 71 0 0 23 Nov 2024
The Explabox: Model-Agnostic Machine Learning Transparency & Analysis Marcel Robeer Michiel Bron Elize Herrewijnen Riwish Hoeseni Floris Bex 67 0 0 22 Nov 2024
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection Tomas Horych Christoph Mandl Terry Ruas André Greiner-Petter Bela Gipp Akiko Aizawa Timo Spinde 96 4 0 17 Nov 2024
Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors Anisha Pal Julia Kruk Mansi Phute Manognya Bhattaram Diyi Yang Duen Horng Chau Judy Hoffman AAML 47 2 0 12 Nov 2024
Orbit: A Framework for Designing and Evaluating Multi-objective Rankers Chenyang Yang Tesi Xiao Michael Shavlovsky Christian Kastner Tongshuang Wu 42 0 0 07 Nov 2024
Diversity Helps Jailbreak Large Language Models Weiliang Zhao Daniel Ben-Levi Wei Hao Junfeng Yang Chengzhi Mao AAML 155 0 0 06 Nov 2024
Benchmark Data Repositories for Better Benchmarking Rachel Longjohn Markelle Kelly Sameer Singh Padhraic Smyth 46 0 0 31 Oct 2024
Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Ioannis Tsiamas Matthias Sperber Andrew Finch Sarthak Garg 36 0 0 31 Oct 2024
$$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker's Ethnicity on Hate Classification$ $\textit{Who Speaks Matters}$ : Analysing the Influence of the Speaker's Ethnicity on Hate Classification Ananya Malik Kartik Sharma Lynnette Hui Xian Ng Shaily Bhatt 34 0 0 27 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models Eddie L. Ungless Nikolas Vitsakis Zeerak Talat James Garforth Bjorn Ross Arno Onken Atoosa Kasirzadeh Alexandra Birch 33 1 0 17 Oct 2024
Tracking Universal Features Through Fine-Tuning and Model Merging Niels Horn Desmond Elliott MoMe 36 0 0 16 Oct 2024