Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.04118
Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Beyond Accuracy: Behavioral Testing of NLP models with CheckList"
50 / 664 papers shown
Title
Step by Step to Fairness: Attributing Societal Bias in Task-oriented Dialogue Systems
Hsuan Su
Rebecca Qian
Chinnadhurai Sankar
Shahin Shayandeh
Shang-Tse Chen
Hung-yi Lee
Daniel M. Bikel
42
0
0
11 Nov 2023
Towards Effective Paraphrasing for Information Disguise
Anmol Agarwal
Shrey Gupta
Vamshi Krishna Bonagiri
Manas Gaur
Joseph M. Reagle
Ponnurangam Kumaraguru
35
3
0
08 Nov 2023
Perturbed examples reveal invariances shared by language models
Ruchit Rawal
Mariya Toneva
AAML
42
0
0
07 Nov 2023
Principles from Clinical Research for NLP Model Generalization
Aparna Elangovan
Jiayuan He
Yuan Li
Karin Verspoor
CML
32
3
0
07 Nov 2023
QualEval: Qualitative Evaluation for Model Improvement
Vishvak Murahari
Ameet Deshpande
Peter Clark
Tanmay Rajpurohit
Ashish Sabharwal
Karthik Narasimhan
Ashwin Kalyan
32
4
0
06 Nov 2023
People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection
Indira Sen
Dennis Assenmacher
Mattia Samory
Isabelle Augenstein
Wil M.P. van der Aalst
Claudia Wagner
17
19
0
02 Nov 2023
Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis
Hongyi Zheng
Abulhair Saparov
AAML
LRM
19
7
0
01 Nov 2023
Sentiment Analysis in Digital Spaces: An Overview of Reviews
L. Ayravainen
Joanne Hinds
Brittany I. Davidson
36
0
0
30 Oct 2023
On General Language Understanding
David Schlangen
40
1
0
27 Oct 2023
Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data
B. V. Breugel
Nabeel Seedat
F. Imrie
M. Schaar
SyDa
26
20
0
25 Oct 2023
Can You Follow Me? Testing Situational Understanding in ChatGPT
Chenghao Yang
Allyson Ettinger
LRM
LLMAG
ELM
112
4
0
24 Oct 2023
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
Sander Schulhoff
Jeremy Pinto
Anaum Khan
Louis-Franccois Bouchard
Chenglei Si
Svetlina Anati
Valen Tagliabue
Anson Liu Kost
Christopher Carnahan
Jordan L. Boyd-Graber
SILM
37
41
0
24 Oct 2023
Linking Surface Facts to Large-Scale Knowledge Graphs
Gorjan Radevski
Kiril Gashteovski
Chia-Chien Hung
Carolin (Haas) Lawrence
Goran Glavavs
HILM
22
3
0
23 Oct 2023
Universal Domain Adaptation for Robust Handling of Distributional Shifts in NLP
Hyuhng Joon Kim
Hyunsoo Cho
Sang-Woo Lee
Junyeob Kim
Choonghyun Park
Sang-goo Lee
Kang Min Yoo
Taeuk Kim
VLM
OOD
48
1
0
23 Oct 2023
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
Zexue He
Yu-Xiang Wang
An Yan
Yao Liu
Eric Y. Chang
Amilcare Gentili
Julian McAuley
Chun-Nan Hsu
ELM
83
14
0
21 Oct 2023
Toward Stronger Textual Attack Detectors
Pierre Colombo
Marine Picot
Nathan Noiry
Guillaume Staerman
Pablo Piantanida
59
5
0
21 Oct 2023
Towards General Error Diagnosis via Behavioral Testing in Machine Translation
Junjie Wu
Lemao Liu
Dit-Yan Yeung
32
2
0
20 Oct 2023
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
Xilie Xu
Keyi Kong
Ning Liu
Li-zhen Cui
Di Wang
Jingfeng Zhang
Mohan Kankanhalli
AAML
SILM
33
68
0
20 Oct 2023
Pseudointelligence: A Unifying Framework for Language Model Evaluation
Shikhar Murty
Orr Paradise
Pratyusha Sharma
15
0
0
18 Oct 2023
A State-Vector Framework for Dataset Effects
E. Sahak
Zining Zhu
Frank Rudzicz
30
1
0
17 Oct 2023
Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using LLMs
Chenyang Yang
Rishabh Rustogi
Rachel A. Brower-Sinning
Grace A. Lewis
Christian Kastner
Tongshuang Wu
KELM
38
12
0
14 Oct 2023
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters
Yixin Wan
George Pu
Jiao Sun
Aparna Garimella
Kai-Wei Chang
Nanyun Peng
34
162
0
13 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
61
582
0
12 Oct 2023
NEWTON: Are Large Language Models Capable of Physical Reasoning?
Yi Ru Wang
Jiafei Duan
Dieter Fox
S. Srinivasa
ELM
LRM
AIMat
ReLM
66
23
0
10 Oct 2023
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
Robert Litschko
Max Müller-Eberstein
Rob van der Goot
Leon Weber
Barbara Plank
LRM
21
2
0
09 Oct 2023
Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences
Fred Hohman
Mary Beth Kery
Donghao Ren
Dominik Moritz
32
16
0
06 Oct 2023
Towards Robust and Generalizable Training: An Empirical Study of Noisy Slot Filling for Input Perturbations
Jiachi Liu
Liwen Wang
Guanting Dong
Xiaoshuai Song
Zechen Wang
...
Shanglin Lei
Jinzheng Zhao
Keqing He
Bo Xiao
Weiran Xu
35
6
0
05 Oct 2023
Observatory: Characterizing Embeddings of Relational Tables
Tianji Cong
Madelon Hulsebos
Zhenjie Sun
Paul Groth
H. V. Jagadish
31
8
0
05 Oct 2023
Co-audit: tools to help humans double-check AI-generated content
Andrew D. Gordon
Carina Negreanu
J. Cambronero
Rasika Chakravarthy
Ian Drosos
...
Hannah Richardson
Advait Sarkar
Stephanie Simmons
Jack Williams
Ben Zorn
39
13
0
02 Oct 2023
No Offense Taken: Eliciting Offensiveness from Language Models
Anugya Srivastava
Rahul Ahuja
Rohith Mukku
14
3
0
02 Oct 2023
Meta Semantic Template for Evaluation of Large Language Models
Yachuan Liu
Liang Chen
Jindong Wang
Qiaozhu Mei
Xing Xie
22
0
0
01 Oct 2023
Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals
Y. Gat
Nitay Calderon
Amir Feder
Alexander Chapanin
Amit Sharma
Roi Reichart
38
29
0
01 Oct 2023
A Brief History of Prompt: Leveraging Language Models. (Through Advanced Prompting)
G. Muktadir
SILM
34
8
0
30 Sep 2023
DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks
A. Maritan
Jiaao Chen
S. Dey
Luca Schenato
Diyi Yang
Xing Xie
ELM
LRM
27
42
0
29 Sep 2023
Language Models as a Service: Overview of a New Paradigm and its Challenges
Emanuele La Malfa
Aleksandar Petrov
Simon Frieder
Christoph Weinhuber
Ryan Burnell
Raza Nazar
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
ALM
ELM
35
3
0
28 Sep 2023
The Trickle-down Impact of Reward (In-)consistency on RLHF
Lingfeng Shen
Sihao Chen
Linfeng Song
Lifeng Jin
Baolin Peng
Haitao Mi
Daniel Khashabi
Dong Yu
34
21
0
28 Sep 2023
Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness
Valentin Barriere
Felipe del Rio
Andres Carvallo De Ferari
Carlos Aspillaga
Eugenio Herrera-Berg
Cristian Buc Calderon
DiffM
27
0
0
27 Sep 2023
EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria
Tae Soo Kim
Yoonjoo Lee
Jamin Shin
Young-Ho Kim
Juho Kim
34
69
0
24 Sep 2023
On the Relationship between Skill Neurons and Robustness in Prompt Tuning
Leon Ackermann
Xenia Ohmer
AAML
29
0
0
21 Sep 2023
Inferring Capabilities from Task Performance with Bayesian Triangulation
John Burden
Konstantinos Voudouris
Ryan Burnell
Danaja Rutar
Lucy G. Cheke
José Hernández-Orallo
25
7
0
21 Sep 2023
ContextRef: Evaluating Referenceless Metrics For Image Description Generation
Elisa Kreiss
E. Zelikman
Christopher Potts
Nick Haber
29
5
0
21 Sep 2023
CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration
Rachneet Sachdeva
Martin Tutek
Iryna Gurevych
OODD
32
10
0
14 Sep 2023
Automating Behavioral Testing in Machine Translation
Javier Ferrando
Matthias Sperber
Hendra Setiawan
Dominic Telaar
Savsa Hasan
30
2
0
05 Sep 2023
UniSA: Unified Generative Framework for Sentiment Analysis
Zaijing Li
Ting-En Lin
Yuchuan Wu
Meng Liu
Fengxiao Tang
Mingde Zhao
Yongbin Li
32
16
0
04 Sep 2023
Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation and Detection of Problematic Content
Charles OÑeill
Jack Miller
I. Ciucă
Y. Ting 丁
Thang Bui
31
3
0
26 Aug 2023
Construction Grammar and Language Models
Harish Tayyar Madabushi
Laurence Romain
P. Milin
Dagmar Divjak
29
5
0
25 Aug 2023
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
Yi Yao
Peng Liu
Tiancheng Zhao
Qianqian Zhang
Jiajia Liao
Chunxin Fang
Kyusong Lee
Qing Wang
VLM
ObjD
29
12
0
25 Aug 2023
Simple is Better and Large is Not Enough: Towards Ensembling of Foundational Language Models
Nancy Tyagi
Aidin Shiri
Surjodeep Sarkar
A. Umrawal
Manas Gaur
27
1
0
23 Aug 2023
LEAP: Efficient and Automated Test Method for NLP Software
Ming-Ming Xiao
Yan Xiao
Hai Dong
Shunhui Ji
Pengcheng Zhang
AAML
22
8
0
22 Aug 2023
An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software
Wenxuan Wang
Jingyuan Huang
Jen-tse Huang
Chang Chen
Jiazhen Gu
Pinjia He
Michael R. Lyu
VLM
36
6
0
18 Aug 2023
Previous
1
2
3
4
5
...
12
13
14
Next