ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2005.04118
  4. Cited By
Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

8 May 2020
Marco Tulio Ribeiro
Tongshuang Wu
Carlos Guestrin
Sameer Singh
    ELM
ArXivPDFHTML

Papers citing "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"

50 / 664 papers shown
Title
Disentangling Hate Across Target Identities
Disentangling Hate Across Target Identities
Yiping Jin
Leo Wanner
Aneesh Moideen Koya
25
0
0
14 Oct 2024
Uncovering Factor Level Preferences to Improve Human-Model Alignment
Uncovering Factor Level Preferences to Improve Human-Model Alignment
Juhyun Oh
Eunsu Kim
Jiseon Kim
Wenda Xu
Inha Cha
William Yang Wang
Alice H. Oh
34
0
0
09 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
87
1
0
09 Oct 2024
Mechanistic?
Mechanistic?
Naomi Saphra
Sarah Wiegreffe
AI4CE
29
9
0
07 Oct 2024
Cognitive Biases in Large Language Models for News Recommendation
Cognitive Biases in Large Language Models for News Recommendation
Yougang Lyu
Xiaoyu Zhang
Zhaochun Ren
Maarten de Rijke
31
2
0
03 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
36
0
0
29 Sep 2024
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from
  Documents guided by Multi-Aspect Feedback Refinement
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement
Ishani Mondal
Zongxia Li
Yufang Hou
Anandhavelu Natarajan
Aparna Garimella
Jordan Boyd-Graber
36
3
0
28 Sep 2024
Faithfulness and the Notion of Adversarial Sensitivity in NLP
  Explanations
Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations
Supriya Manna
Niladri Sett
AAML
29
2
0
26 Sep 2024
Reducing and Exploiting Data Augmentation Noise through Meta Reweighting
  Contrastive Learning for Text Classification
Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification
Guanyi Mou
Yichuan Li
Kyumin Lee
36
3
0
26 Sep 2024
An Effective, Robust and Fairness-aware Hate Speech Detection Framework
An Effective, Robust and Fairness-aware Hate Speech Detection Framework
Guanyi Mou
Kyumin Lee
29
2
0
25 Sep 2024
What Is Wrong with My Model? Identifying Systematic Problems with
  Semantic Data Slicing
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing
Chenyang Yang
Yining Hong
Grace A. Lewis
Tongshuang Wu
Christian Kastner
38
1
0
14 Sep 2024
Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of
  Human Responses in Dialogue
Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue
Jonathan Ivey
Shivani Kumar
Jiayu Liu
Hua Shen
Sushrita Rakshit
...
Dustin Wright
Abraham Israeli
Anders Giovanni Møller
Lechen Zhang
David Jurgens
47
3
0
12 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
57
1
0
05 Sep 2024
Report Cards: Qualitative Evaluation of Language Models Using Natural
  Language Summaries
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang
Fuyang Cui
Keiran Paster
Jimmy Ba
Pashootan Vaezipoor
Silviu Pitis
Michael Ruogu Zhang
28
1
0
01 Sep 2024
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic
  CheckLists
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao
Abdullatif Köksal
Yihong Liu
Leonie Weissweiler
Anna Korhonen
Hinrich Schütze
SyDa
38
1
0
30 Aug 2024
Automatic Differential Diagnosis using Transformer-Based Multi-Label
  Sequence Classification
Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
Abu Adnan Sadi
Mohammad Ashrafuzzaman Khan
Lubaba Binte Saber
43
0
0
28 Aug 2024
SCENE: Evaluating Explainable AI Techniques Using Soft Counterfactuals
SCENE: Evaluating Explainable AI Techniques Using Soft Counterfactuals
Haoran Zheng
Utku Pamuksuz
29
0
0
08 Aug 2024
Adversarial Text Rewriting for Text-aware Recommender Systems
Adversarial Text Rewriting for Text-aware Recommender Systems
Ganesh Ghalme
Reshef Meir
Srijan Kumar
42
0
0
01 Aug 2024
Automatic Generation of Behavioral Test Cases For Natural Language
  Processing Using Clustering and Prompting
Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting
Ying Li
Rahul Singh
Tarun Joshi
Agus Sudjianto
30
0
0
31 Jul 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
Nitay Calderon
Roi Reichart
40
10
0
27 Jul 2024
Benchmarks as Microscopes: A Call for Model Metrology
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Y. Wang
Naomi Saphra
39
10
0
22 Jul 2024
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical
  Reasoning with Checklist
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou
Shudong Liu
Maizhen Ning
Wei Liu
Jindong Wang
Derek F. Wong
Xiaowei Huang
Qiufeng Wang
Kaizhu Huang
ELM
LRM
68
23
0
11 Jul 2024
AutoBencher: Towards Declarative Benchmark Construction
AutoBencher: Towards Declarative Benchmark Construction
Xiang Lisa Li
E. Liu
Percy Liang
Tatsunori Hashimoto
Percy Liang
Tatsunori Hashimoto
53
2
0
11 Jul 2024
A Survey on Natural Language Counterfactual Generation
A Survey on Natural Language Counterfactual Generation
Yongjie Wang
Xiaoqi Qiu
Yu Yue
Xu Guo
Zhiwei Zeng
Yuhong Feng
Zhiqi Shen
42
5
0
04 Jul 2024
Social Bias Evaluation for Large Language Models Requires Prompt
  Variations
Social Bias Evaluation for Large Language Models Requires Prompt Variations
Rem Hida
Masahiro Kaneko
Naoaki Okazaki
38
14
0
03 Jul 2024
Evaluating the Robustness of Adverse Drug Event Classification Models
  Using Templates
Evaluating the Robustness of Adverse Drug Event Classification Models Using Templates
Dorothea MacPhail
David Harbecke
Lisa Raithel
Sebastian Möller
25
1
0
02 Jul 2024
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?
Nishant Balepur
Rachel Rudinger
50
6
0
02 Jul 2024
A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf
  Affect-related Tweet Classifiers
A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers
Valentin Barriere
Sebastian Cifuentes
28
0
0
01 Jul 2024
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human
  Curricula
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
Shubhra Mishra
Gabriel Poesia
Belinda Mo
Noah D. Goodman
40
3
0
01 Jul 2024
Fuzzy Logic Guided Reward Function Variation: An Oracle for Testing
  Reinforcement Learning Programs
Fuzzy Logic Guided Reward Function Variation: An Oracle for Testing Reinforcement Learning Programs
Shiyu Zhang
Haoyang Song
Qixin Wang
Yu Pei
42
0
0
28 Jun 2024
Changing Answer Order Can Decrease MMLU Accuracy
Changing Answer Order Can Decrease MMLU Accuracy
Vipul Gupta
David Pantoja
Candace Ross
Adina Williams
Megan Ung
64
22
0
27 Jun 2024
Automated Adversarial Discovery for Safety Classifiers
Automated Adversarial Discovery for Safety Classifiers
Yash Kumar Lal
Preethi Lahoti
Aradhana Sinha
Yao Qin
Ananth Balashankar
55
0
0
24 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
42
11
0
19 Jun 2024
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness
  Evaluation in Large Language Models
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models
Yuqing Wang
Yun Zhao
LRM
AAML
ELM
27
1
0
16 Jun 2024
Enhancing Question Answering on Charts Through Effective Pre-training
  Tasks
Enhancing Question Answering on Charts Through Effective Pre-training Tasks
Ashim Gupta
Vivek Gupta
Shuo Zhang
Yujie He
Ning Zhang
Shalin S Shah
33
2
0
14 Jun 2024
Using Quality Attribute Scenarios for ML Model Test Case Generation
Using Quality Attribute Scenarios for ML Model Test Case Generation
Rachel A. Brower-Sinning
Grace A. Lewis
Sebastían Echeverría
Ipek Ozkaya
37
0
0
12 Jun 2024
Adversarial Evasion Attack Efficiency against Large Language Models
Adversarial Evasion Attack Efficiency against Large Language Models
João Vitorino
Eva Maia
Isabel Praça
AAML
43
2
0
12 Jun 2024
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
Junlin Wang
Tianyi Yang
Roy Xie
Bhuwan Dhingra
SILM
AAML
36
4
0
10 Jun 2024
Are LLMs classical or nonmonotonic reasoners? Lessons from generics
Are LLMs classical or nonmonotonic reasoners? Lessons from generics
Alina Leidinger
R. Rooij
Ekaterina Shutova
LRM
28
3
0
05 Jun 2024
Probing the Category of Verbal Aspect in Transformer Language Models
Probing the Category of Verbal Aspect in Transformer Language Models
Anisia Katinskaia
R. Yangarber
58
2
0
04 Jun 2024
Harnessing Business and Media Insights with Large Language Models
Harnessing Business and Media Insights with Large Language Models
Yujia Bao
Ankit Parag Shah
Neeru Narang
Jonathan Rivers
Rajeev Maksey
...
Gyuhak Kim
Dengpan Yin
Don Hejna
Mo Nomeli
Wei Wei
AIFin
46
2
0
02 Jun 2024
WebSuite: Systematically Evaluating Why Web Agents Fail
WebSuite: Systematically Evaluating Why Web Agents Fail
Eric Li
Jim Waldo
LLMAG
28
4
0
01 Jun 2024
PertEval: Unveiling Real Knowledge Capacity of LLMs with
  Knowledge-Invariant Perturbations
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Jiatong Li
Renjun Hu
Kunzhe Huang
Zhuang Yan
Qi Liu
Mengxiao Zhu
Xing Shi
Wei Lin
KELM
51
5
0
30 May 2024
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation
  for Generative Large Language Models
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models
Aparna Elangovan
Ling Liu
Lei Xu
S. Bodapati
Dan Roth
ELM
30
9
0
28 May 2024
Natural Language Processing RELIES on Linguistics
Natural Language Processing RELIES on Linguistics
Juri Opitz
Shira Wein
Nathan Schneider
AI4CE
55
7
0
09 May 2024
Mitigating Exaggerated Safety in Large Language Models
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
37
1
0
08 May 2024
Zero-shot LLM-guided Counterfactual Generation for Text
Zero-shot LLM-guided Counterfactual Generation for Text
Amrita Bhattacharjee
Raha Moraffah
Joshua Garland
Huan Liu
46
4
0
08 May 2024
Are Models Biased on Text without Gender-related Language?
Are Models Biased on Text without Gender-related Language?
Catarina G Belém
P. Seshadri
Yasaman Razeghi
Sameer Singh
38
8
0
01 May 2024
Human-in-the-Loop Synthetic Text Data Inspection with Provenance
  Tracking
Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking
Hong Jin Kang
Fabrice Harel-Canada
Muhammad Ali Gulzar
Violet Peng
Miryung Kim
44
2
0
29 Apr 2024
Empowering Large Language Models for Textual Data Augmentation
Empowering Large Language Models for Textual Data Augmentation
Yichuan Li
Kaize Ding
Jianling Wang
Kyumin Lee
29
10
0
26 Apr 2024
Previous
12345...121314
Next