Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
Learning Repetition-Invariant Representations for Polymer Informatics
Yihan Zhu
Gang Liu
Eric Inae
Tengfei Luo
Meng Jiang
17
0
0
15 May 2025
IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method
Mihyeon Kim
Juhyoung Park
Youngbin Kim
34
0
0
11 May 2025
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
Leon Eshuijs
Shihan Wang
Antske Fokkens
26
0
0
09 May 2025
PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents
Takyoung Kim
Janvijay Singh
Shuhaib Mehri
Emre Can Acikgoz
Sagnik Mukherjee
Nimet Beyza Bozdag
Sumuk Shashidhar
Gokhan Tur
Dilek Hakkani-Tur
LLMAG
29
0
0
02 May 2025
Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets
Masumi Morishige
Ryo Koshihara
ALM
14
0
0
02 May 2025
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
47
0
0
28 Apr 2025
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning
Teeradaj Racharak
Chaiyong Ragkhitwetsagul
Chommakorn Sontesadisai
Thanwadee Sunetnanta
40
0
0
26 Apr 2025
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs
Mohammad Akbar-Tajari
Mohammad Taher Pilehvar
Mohammad Mahmoody
AAML
48
0
0
26 Apr 2025
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova
Hung Thinh Truong
Rahmad Mahendra
Zenan Zhai
Rongxin Zhu
Daniel Beck
Jey Han Lau
ELM
70
0
0
24 Apr 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Barbara Plank
35
0
0
22 Apr 2025
aiXamine: Simplified LLM Safety and Security
Fatih Deniz
Dorde Popovic
Yazan Boshmaf
Euisuh Jeong
M. Ahmad
Sanjay Chawla
Issa M. Khalil
ELM
80
0
0
21 Apr 2025
Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations
Yiyou Sun
Y. Gai
Lijie Chen
Abhilasha Ravichander
Yejin Choi
D. Song
HILM
57
0
0
17 Apr 2025
The Code Barrier: What LLMs Actually Understand?
Serge Lionel Nikiema
Jordan Samhi
A. Kaboré
Jacques Klein
Tegawende F. Bissyande
ELM
29
1
0
14 Apr 2025
Cognitive Debiasing Large Language Models for Decision-Making
Yougang Lyu
Shijie Ren
Yue Feng
Zihan Wang
Z. Chen
Z. Z. Ren
Maarten de Rijke
41
0
0
05 Apr 2025
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
Aryan Agrawal
Lisa Alazraki
Shahin Honarvar
Marek Rei
57
0
0
03 Apr 2025
Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach
Hongliu Cao
67
0
0
01 Apr 2025
Pay More Attention to the Robustness of Prompt for Instruction Data Mining
Qiang Wang
Dawei Feng
Xu Zhang
Ao Shen
Yang Xu
Bo Ding
H. Wang
AAML
48
0
0
31 Mar 2025
On Explaining (Large) Language Models For Code Using Global Code-Based Explanations
David Nader-Palacio
Dipin Khati
Daniel Rodríguez-Cárdenas
Alejandro Velasco
Denys Poshyvanyk
LRM
47
0
0
21 Mar 2025
Model Risk Management for Generative AI In Financial Institutions
Anwesha Bhattacharyya
Ye Yu
Hanyu Yang
Rahul Singh
Tarun Joshi
Jie Chen
Kiran Yalavarthy
AIFin
MedIm
46
0
0
19 Mar 2025
Prompt Sentiment: The Catalyst for LLM Change
Vishal Gandhi
Sagar Gandhi
52
1
0
14 Mar 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
Zhaofeng Wu
Michihiro Yasunaga
Andrew Cohen
Yoon Kim
Asli Celikyilmaz
Marjan Ghazvininejad
46
2
0
14 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
67
4
0
07 Mar 2025
AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models
Hengrui Xing
Cong Tian
L. Zhao
Z. Ma
WenSheng Wang
N. Zhang
Chao Huang
Zhenhua Duan
49
0
0
07 Mar 2025
The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats
William Brach
Kristián Košťál
Michal Ries
197
0
0
04 Mar 2025
Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models
Tabinda Sarwar
Antonio Jose Jimeno Yepes
Lawrence Cavedon
69
0
0
12 Feb 2025
SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation
Saurabh Kumar Pandey
S. Vashistha
Debrup Das
Somak Aditya
Monojit Choudhury
AAML
74
0
0
10 Feb 2025
A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks
Elie Antoine
Frédéric Béchet
Géraldine Damnati
Philippe Langlais
56
1
0
29 Jan 2025
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models
Ran Xu
Hejie Cui
Yue Yu
Xuan Kan
Wenqi Shi
Yuchen Zhuang
Wei Jin
Joyce C. Ho
Carl Yang
69
14
0
28 Jan 2025
Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics
Ameya Godbole
Robin Jia
HILM
53
1
0
24 Jan 2025
Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts
Preethi Seshadri
Seraphina Goldfarb-Tarrant
40
1
0
08 Jan 2025
Predictable Artificial Intelligence
Lexin Zhou
Pablo Antonio Moreno Casares
Fernando Martínez-Plumed
John Burden
Ryan Burnell
...
Seán Ó hÉigeartaigh
Danaja Rutar
Wout Schellaert
Konstantinos Voudouris
José Hernández-Orallo
51
2
0
08 Jan 2025
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets
Vatsal Gupta
Pranshu Pandya
Tushar Kataria
Vivek Gupta
Dan Roth
AAML
57
1
0
03 Jan 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
93
12
0
31 Dec 2024
The Evolution of LLM Adoption in Industry Data Curation Practices
Crystal Qian
Michael Xieyang Liu
Emily Reif
Grady Simon
Nada Hussein
Nathan Clement
James Wexler
Carrie J. Cai
Michael Terry
Minsuk Kahng
AILaw
ELM
77
4
0
20 Dec 2024
Unpacking the Resilience of SNLI Contradiction Examples to Attacks
Chetan Verma
Archit Agarwal
AAML
74
0
0
15 Dec 2024
Human-Centric NLP or AI-Centric Illusion?: A Critical Investigation
Piyapath T Spencer
80
0
0
14 Dec 2024
Neural Text Normalization for Luxembourgish using Real-Life Variation Data
Anne-Marie Lutgen
Alistair Plum
Christoph Purschke
Barbara Plank
72
1
0
12 Dec 2024
Improving Object Detection by Modifying Synthetic Data with Explainable AI
Nitish Mital
Simon Malzard
Richard Walters
Celso M. De Melo
Raghuveer Rao
Victoria Nockles
80
0
0
02 Dec 2024
Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection
Shanu Kumar
Saish Mendke
Karody Lubna Abdul Rahman
Santosh Kurasa
Parag Agrawal
Sandipan Dandapat
LLMAG
LRM
70
2
0
30 Nov 2024
Interactive Visual Assessment for Text-to-Image Generation Models
Xiaoyue Mi
Fan Tang
Juan Cao
Qiang Sheng
Ziyao Huang
Peng Li
Yi Liu
Tong-Yee Lee
EGVM
71
0
0
23 Nov 2024
The Explabox: Model-Agnostic Machine Learning Transparency & Analysis
Marcel Robeer
Michiel Bron
Elize Herrewijnen
Riwish Hoeseni
Floris Bex
67
0
0
22 Nov 2024
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection
Tomas Horych
Christoph Mandl
Terry Ruas
André Greiner-Petter
Bela Gipp
Akiko Aizawa
Timo Spinde
96
4
0
17 Nov 2024
Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors
Anisha Pal
Julia Kruk
Mansi Phute
Manognya Bhattaram
Diyi Yang
Duen Horng Chau
Judy Hoffman
AAML
47
2
0
12 Nov 2024
Orbit: A Framework for Designing and Evaluating Multi-objective Rankers
Chenyang Yang
Tesi Xiao
Michael Shavlovsky
Christian Kastner
Tongshuang Wu
42
0
0
07 Nov 2024
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
155
0
0
06 Nov 2024
Benchmark Data Repositories for Better Benchmarking
Rachel Longjohn
Markelle Kelly
Sameer Singh
Padhraic Smyth
46
0
0
31 Oct 2024
Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
Ioannis Tsiamas
Matthias Sperber
Andrew Finch
Sarthak Garg
36
0
0
31 Oct 2024
Who
Speaks
Matters
\textit{Who Speaks Matters}
Who Speaks Matters
: Analysing the Influence of the Speaker's Ethnicity on Hate Classification
Ananya Malik
Kartik Sharma
Lynnette Hui Xian Ng
Shaily Bhatt
34
0
0
27 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
33
1
0
17 Oct 2024
Tracking Universal Features Through Fine-Tuning and Model Merging
Niels Horn
Desmond Elliott
MoMe
36
0
0
16 Oct 2024
1
2
3
4
...
12
13
14
Next