Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models
Yugeng Liu
Tianshuo Cong
Zhengyu Zhao
Michael Backes
Yun Shen
Yang Zhang
AAML
41
6
0
15 Aug 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
53
19
0
14 Aug 2023
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Xinyue Shen
Zhenpeng Chen
Michael Backes
Yun Shen
Yang Zhang
SILM
40
249
0
07 Aug 2023
Explaining Relation Classification Models with Semantic Extents
Lars Klöser
André Büsgen
Philipp Kohl
Bodo Kraft
Albert Zündorf
19
0
0
04 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALM
ELM
AILaw
27
127
0
02 Aug 2023
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
Ari Holtzman
Peter West
Luke Zettlemoyer
AI4CE
32
14
0
31 Jul 2023
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks
Xinyu Zhang
Hanbin Hong
Yuan Hong
Peng Huang
Binghui Wang
Zhongjie Ba
Kui Ren
SILM
42
18
0
31 Jul 2023
The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems
Andreas Liesenfeld
Alianda Lopez
Mark Dingemanse
26
8
0
28 Jul 2023
HateModerate: Testing Hate Speech Detectors against Content Moderation Policies
Jiangrui Zheng
Xueqing Liu
Guanqun Yang
Mirazul Haque
Xing Qian
Ravishka Rathnasuriya
Wei Yang
G. Budhrani
45
3
0
23 Jul 2023
LLM Cognitive Judgements Differ From Human
Sotiris Lamprinidis
ELM
32
9
0
20 Jul 2023
Behavioral Analysis of Vision-and-Language Navigation Agents
Zijiao Yang
Arjun Majumdar
Stefan Lee
LM&Ro
LLMAG
19
9
0
20 Jul 2023
Instruction-following Evaluation through Verbalizer Manipulation
Shiyang Li
Jun Yan
Hai Wang
Zheng Tang
Xiang Ren
Vijay Srinivasan
Hongxia Jin
36
25
0
20 Jul 2023
MGit: A Model Versioning and Management System
Wei Hao
Daniel Mendoza
Rafael Ferreira da Silva
Deepak Narayanan
Amar Phanishayee
VLM
27
1
0
14 Jul 2023
How Different Is Stereotypical Bias Across Languages?
Ibrahim Tolga Ozturk
R. Nedelchev
C. Heumann
Esteban Garces Arias
Marius Roger
Bernd Bischl
Matthias Aßenmacher
28
2
0
14 Jul 2023
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features
Ester Hlavnova
Sebastian Ruder
35
5
0
11 Jul 2023
A Survey on Evaluation of Large Language Models
Yu-Chu Chang
Xu Wang
Jindong Wang
Yuanyi Wu
Linyi Yang
...
Yue Zhang
Yi-Ju Chang
Philip S. Yu
Qian Yang
Xingxu Xie
ELM
LM&MA
ALM
75
1,517
0
06 Jul 2023
SpaceNLI: Evaluating the Consistency of Predicting Inferences in Space
Lasha Abzianidze
J. Zwarts
Yoad Winter
24
2
0
05 Jul 2023
Concept-Based Explanations to Test for False Causal Relationships Learned by Abusive Language Classifiers
I. Nejadgholi
S. Kiritchenko
Kathleen C. Fraser
Esma Balkir
26
0
0
04 Jul 2023
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
Neel Jain
Khalid Saifullah
Yuxin Wen
John Kirchenbauer
Manli Shu
Aniruddha Saha
Micah Goldblum
Jonas Geiping
Tom Goldstein
ALM
ELM
30
23
0
23 Jun 2023
Towards Explainable Evaluation Metrics for Machine Translation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
ELM
30
13
0
22 Jun 2023
Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
Shivaen Ramshetty
Gaurav Verma
Srijan Kumar
33
2
0
19 Jun 2023
Evaluating Superhuman Models with Consistency Checks
Lukas Fluri
Daniel Paleka
Florian Tramèr
ELM
50
42
0
16 Jun 2023
SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and its Evaluation
Md. Ekramul Islam
Labib Chowdhury
Faisal Ahamed Khan
Shazzad Hossain
Sourave Hossain
Mohammad Mamun Or Rashid
Nabeel Mohammed
M. R. Amin
14
11
0
09 Jun 2023
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
Kaijie Zhu
Jindong Wang
Jiaheng Zhou
Zichen Wang
Hao Chen
...
Linyi Yang
Weirong Ye
Yue Zhang
Neil Zhenqiang Gong
Xingxu Xie
SILM
41
144
0
07 Jun 2023
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
John Joon Young Chung
Ece Kamar
Saleema Amershi
ALM
34
109
0
07 Jun 2023
MISGENDERED: Limits of Large Language Models in Understanding Pronouns
Tamanna Hossain
Sunipa Dev
Sameer Singh
AILaw
35
34
0
06 Jun 2023
Utterance Classification with Logical Neural Network: Explainable AI for Mental Disorder Diagnosis
Yeldar Toleubay
Don Joven Agravante
Daiki Kimura
Baihan Lin
Djallel Bouneffouf
Michiaki Tatsubori
18
4
0
06 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip Torr
Volker Tresp
VPVLM
VLM
34
17
0
03 Jun 2023
Multilingual Conceptual Coverage in Text-to-Image Models
Michael Stephen Saxon
William Yang Wang
EGVM
31
8
0
02 Jun 2023
VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations
Hoang-Quoc Nguyen-Son
Seira Hidano
Kazuhide Fukushima
S. Kiyomoto
Isao Echizen
28
0
0
02 Jun 2023
UKP-SQuARE: An Interactive Tool for Teaching Question Answering
Haishuo Fang
Haritz Puerto
Iryna Gurevych
26
1
0
31 May 2023
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
Yangyi Chen
Hongcheng Gao
Ganqu Cui
Lifan Yuan
Dehan Kong
...
Longtao Huang
H. Xue
Zhiyuan Liu
Maosong Sun
Heng Ji
AAML
ELM
27
6
0
29 May 2023
Targeted Data Generation: Finding and Fixing Model Weaknesses
Zexue He
Marco Tulio Ribeiro
Fereshte Khani
29
13
0
28 May 2023
FERMAT: An Alternative to Accuracy for Numerical Reasoning
Jasivan Sivakumar
N. Moosavi
ReLM
LRM
40
3
0
27 May 2023
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
Deokjae Lee
JunYeong Lee
Jung-Woo Ha
Jin-Hwa Kim
Sang-Woo Lee
Hwaran Lee
Hyun Oh Song
AAML
24
23
0
27 May 2023
Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning
Ruixiang Tang
Dehan Kong
Lo-li Huang
Hui Xue
34
50
0
26 May 2023
CREST: A Joint Framework for Rationalization and Counterfactual Text Generation
Marcos Vinícius Treviso
Alexis Ross
Nuno M. Guerreiro
André F.T. Martins
29
16
0
26 May 2023
Controlling Learned Effects to Reduce Spurious Correlations in Text Classifiers
Parikshit Bansal
Amit Sharma
CML
26
5
0
26 May 2023
Not wacky vs. definitely wacky: A study of scalar adverbs in pretrained language models
Isabelle Lorge
J. Pierrehumbert
41
0
0
25 May 2023
On Degrees of Freedom in Defining and Testing Natural Language Understanding
Saku Sugawara
S. Tsugita
ELM
34
1
0
24 May 2023
MuLER: Detailed and Scalable Reference-based Evaluation
Taelin Karidi
Leshem Choshen
Gal Patel
Omri Abend
40
0
0
24 May 2023
Adversarial Demonstration Attacks on Large Language Models
Jiong Wang
Zi-yang Liu
Keun Hee Park
Zhuojun Jiang
Zhaoheng Zheng
Zhuofeng Wu
Muhao Chen
Chaowei Xiao
SILM
30
52
0
24 May 2023
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
Victoria Basmov
Yoav Goldberg
Reut Tsarfaty
ReLM
LRM
32
5
0
24 May 2023
Debiasing should be Good and Bad: Measuring the Consistency of Debiasing Techniques in Language Models
Robert D Morabito
Jad Kabbara
Ali Emami
19
6
0
23 May 2023
Out-of-Distribution Generalization in Text Classification: Past, Present, and Future
Linyi Yang
Yangqiu Song
Xuan Ren
Chenyang Lyu
Yidong Wang
Lingqiao Liu
Jindong Wang
Jennifer Foster
Yue Zhang
OOD
37
2
0
23 May 2023
Validating Multimedia Content Moderation Software via Semantic Fusion
Wenxuan Wang
Jingyuan Huang
Chang Chen
Jiazhen Gu
Jianping Zhang
Weibin Wu
Pinjia He
Michael Lyu
75
9
0
23 May 2023
Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals
Ananth Balashankar
Xuezhi Wang
Yao Qin
Ben Packer
Nithum Thain
Jilin Chen
Ed H. Chi
Alex Beutel
25
0
0
22 May 2023
Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection
Rheeya Uppaal
Junjie Hu
Yixuan Li
OODD
119
33
0
22 May 2023
Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection
Mithun Das
Saurabh Kumar Pandey
Animesh Mukherjee
51
10
0
22 May 2023
Cross-functional Analysis of Generalisation in Behavioural Learning
Pedro Henrique Luz de Araujo
Benjamin Roth
29
3
0
22 May 2023
Previous
1
2
3
4
5
6
...
12
13
14
Next