v1v2v3 (latest)

Stress Test Evaluation for Natural Language Inference

2 June 2018

Aakanksha Naik

Abhilasha Ravichander

Graham Neubig

Papers citing "Stress Test Evaluation for Natural Language Inference"

50 / 149 papers shown

Title
Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants Jaione Bengoetxea Itziar Gonzalez-Dios Rodrigo Agerri 19 0 0 18 Jun 2025
Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding Yeonkyoung So Gyuseong Lee Sungmok Jung Joonhak Lee JiA Kang Sangho Kim Jaejin Lee 38 0 0 17 Jun 2025
Exploring Explanations Improves the Robustness of In-Context Learning Ukyo Honda Tatsushi Oka LRM 70 0 0 03 Jun 2025
What Has Been Lost with Synthetic Evaluation? Alexander Gill Abhilasha Ravichander Ana Marasović ELM 36 0 0 28 May 2025
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification Leon Eshuijs Shihan Wang Antske Fokkens 144 0 0 09 May 2025
aiXamine: Simplified LLM Safety and Security Fatih Deniz Dorde Popovic Yazan Boshmaf Euisuh Jeong M. Ahmad Sanjay Chawla Issa M. Khalil ELM 341 0 0 21 Apr 2025
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations Man Ho Lam Chaozheng Wang Jen-tse Huang Michael R. Lyu LRM 112 1 0 19 Apr 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs Zhaofeng Wu Michihiro Yasunaga Andrew Cohen Yoon Kim Asli Celikyilmaz Marjan Ghazvininejad 90 3 0 14 Mar 2025
Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation Yue Zhou Yi-Ju Chang Yuan Wu MoMe 122 3 0 21 Feb 2025
From Superficial Patterns to Semantic Understanding: Fine-Tuning Language Models on Contrast Sets Daniel Petrov 50 0 0 05 Jan 2025
Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task? Zhiqiang Pi Annapurna Vadaparty Benjamin Bergen Cameron R. Jones 81 3 0 20 Jun 2024
Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models Vishruth Veerendranath Vishwa Shah Kshitish Ghate 101 0 0 22 Apr 2024
Specification Overfitting in Artificial Intelligence Benjamin Roth Pedro Henrique Luz de Araujo Yuxi Xia Saskia Kaltenbrunner Christoph Korab 233 1 0 13 Mar 2024
Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models Erik Arakelyan Zhaoqi Liu Isabelle Augenstein AAML 145 12 0 25 Jan 2024
Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study Maike Zufle Verna Dankers Ivan Titov 96 0 0 16 Nov 2023
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness Ashim Gupta Rishanth Rajendhran Nathan Stringham Vivek Srikumar Ana Marasović AAML 90 3 0 16 Nov 2023
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features Ester Hlavnova Sebastian Ruder 84 5 0 11 Jul 2023
Evaluating Paraphrastic Robustness in Textual Entailment Models Dhruv Verma Yash Kumar Lal Shreyashee Sinha Benjamin Van Durme Adam Poliak 91 5 0 29 Jun 2023
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework Yangyi Chen Hongcheng Gao Ganqu Cui Lifan Yuan Dehan Kong ... Longtao Huang H. Xue Zhiyuan Liu Maosong Sun Heng Ji AAML ELM 101 6 0 29 May 2023
On Degrees of Freedom in Defining and Testing Natural Language Understanding Saku Sugawara S. Tsugita ELM 86 1 0 24 May 2023
Out-of-Distribution Generalization in Text Classification: Past, Present, and Future Linyi Yang Yangqiu Song Xuan Ren Chenyang Lyu Yidong Wang Lingqiao Liu Jindong Wang Jennifer Foster Yue Zhang OOD 129 3 0 23 May 2023
A Mixed-Methods Approach to Understanding User Trust after Voice Assistant Failures Amanda Baughan Allison Mercurio Ariel Liu Xuezhi Wang Jilin Chen Xiao Ma 76 15 0 01 Mar 2023
SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases Yanchen Liu Jing Yang Yan Chen Jing Liu Huaqin Wu MoE 85 2 0 28 Feb 2023
On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex Terry Yue Zhuo Zhuang Li Yujin Huang Fatemeh Shiri Weiqing Wang Gholamreza Haffari Yuan-Fang Li AAML 107 57 0 30 Jan 2023
DISCO: Distilling Counterfactuals with Large Language Models Zeming Chen Qiyue Gao Antoine Bosselut Ashish Sabharwal Kyle Richardson 96 31 0 20 Dec 2022
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation Tianxing He Jingyu Zhang Tianle Wang Sachin Kumar Kyunghyun Cho James R. Glass Yulia Tsvetkov 150 45 0 20 Dec 2022
Feature-Level Debiased Natural Language Understanding Yougang Lyu Piji Li Yechang Yang Maarten de Rijke Fajie Yuan Yukun Zhao D. Yin Zhaochun Ren 91 12 0 11 Dec 2022
AGRO: Adversarial Discovery of Error-prone groups for Robust Optimization Bhargavi Paranjape Pradeep Dasigi Vivek Srikumar Luke Zettlemoyer Hannaneh Hajishirzi 98 8 0 02 Dec 2022
AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning Jiaxin Wen Yeshuang Zhu Jinchao Zhang Jie Zhou Minlie Huang CML AAML 115 9 0 29 Nov 2022
Using Focal Loss to Fight Shallow Heuristics: An Empirical Analysis of Modulated Cross-Entropy in Natural Language Inference Frano Rajic Ivan Stresec Axel Marmet Tim Postuvan 46 3 0 23 Nov 2022
Capabilities for Better ML Engineering Chenyang Yang Rachel A. Brower-Sinning Grace A. Lewis Christian Kastner Tongshuang Wu 63 4 0 11 Nov 2022
Looking at the Overlooked: An Analysis on the Word-Overlap Bias in Natural Language Inference S. Rajaee Yadollah Yaghoobzadeh Mohammad Taher Pilehvar 73 5 0 07 Nov 2022
Probing neural language models for understanding of words of estimative probability Damien Sileo Marie-Francine Moens 51 12 0 07 Nov 2022
Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic Mandar Sharma Nikhil Muralidhar Naren Ramakrishnan 58 6 0 03 Nov 2022
CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation Abhilasha Ravichander Matt Gardner Ana Marasović 112 35 0 01 Nov 2022
Lexical Generalization Improves with Larger Models and Longer Training Elron Bandel Yoav Goldberg Yanai Elazar 94 7 0 23 Oct 2022
Enhancing Tabular Reasoning with Pattern Exploiting Training Abhilash Shankarampeta Vivek Gupta Shuo Zhang LMTD RALM ReLM 139 6 0 21 Oct 2022
Measures of Information Reflect Memorization Patterns Rachit Bansal Danish Pruthi Yonatan Belinkov 110 10 0 17 Oct 2022
A Survey of Parameters Associated with the Quality of Benchmarks in NLP Swaroop Mishra Anjana Arunkumar Chris Bryan Chitta Baral 105 1 0 14 Oct 2022
Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding Songyang Gao Shihan Dou Qi Zhang Xuanjing Huang 43 8 0 14 Oct 2022
Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole Robin Jia ALM 79 9 0 13 Oct 2022
CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation Tanay Dixit Bhargavi Paranjape Hannaneh Hajishirzi Luke Zettlemoyer SyDa 206 26 0 10 Oct 2022
InferES : A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev Mariona Taulé 73 4 0 06 Oct 2022
Compositional Evaluation on Japanese Textual Entailment and Similarity Hitomi Yanaka K. Mineshima 93 24 0 09 Aug 2022
Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions Yanai Elazar Nora Kassner Shauli Ravfogel Amir Feder Abhilasha Ravichander Marius Mosbach Yonatan Belinkov Hinrich Schütze Yoav Goldberg CML SyDa MILM 110 55 0 28 Jul 2022
Probing via Prompting Jiaoda Li Ryan Cotterell Mrinmaya Sachan 109 13 0 04 Jul 2022
longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks Venelin Kovatchev Trina Chatterjee Venkata S Govindarajan Jifan Chen Eunsol Choi ... K. Erk Matthew Lease Junyi Jessy Li Yating Wu Kyle Mahowald AAML ELM 89 9 0 29 Jun 2022
LegoNN: Building Modular Encoder-Decoder Models Siddharth Dalmia Dmytro Okhonko M. Lewis Sergey Edunov Shinji Watanabe Florian Metze Luke Zettlemoyer Abdel-rahman Mohamed AuLLM MoE 71 14 0 07 Jun 2022
Linear Connectivity Reveals Generalization Strategies Jeevesh Juneja Rachit Bansal Kyunghyun Cho João Sedoc Naomi Saphra 333 48 0 24 May 2022
White-box Testing of NLP models with Mask Neuron Coverage Arshdeep Sekhon Yangfeng Ji Matthew B. Dwyer Yanjun Qi AAML 52 3 0 10 May 2022