Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1806.00692
Cited By
v1
v2
v3 (latest)
Stress Test Evaluation for Natural Language Inference
2 June 2018
Aakanksha Naik
Abhilasha Ravichander
Norman M. Sadeh
Carolyn Rose
Graham Neubig
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Stress Test Evaluation for Natural Language Inference"
50 / 149 papers shown
Title
Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks
Ruixiang Cui
Daniel Hershcovich
Anders Søgaard
87
13
0
22 Apr 2022
When Does Syntax Mediate Neural Language Model Performance? Evidence from Dropout Probes
Mycal Tucker
Tiwalayo Eisape
Peng Qian
R. Levy
J. Shah
MILM
66
12
0
20 Apr 2022
mGPT: Few-Shot Learners Go Multilingual
Oleh Shliazhko
Alena Fenogenova
Maria Tikhonova
Vladislav Mikhailov
Anastasia Kozlova
Tatiana Shavrina
137
155
0
15 Apr 2022
Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets
Yuxiang Wu
Matt Gardner
Pontus Stenetorp
Pradeep Dasigi
95
68
0
24 Mar 2022
An Analysis of Negation in Natural Language Understanding Corpora
Md Mosharaf Hossain
Dhivya Chinnappa
Eduardo Blanco
116
43
0
16 Mar 2022
Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness
Tejas Gokhale
Swaroop Mishra
Man Luo
Bhavdeep Singh Sachdeva
Chitta Baral
102
30
0
15 Mar 2022
Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings
Neeraj Varshney
Swaroop Mishra
Chitta Baral
105
56
0
01 Mar 2022
Predicting Out-of-Distribution Error with the Projection Norm
Yaodong Yu
Zitong Yang
Alexander Wei
Yi-An Ma
Jacob Steinhardt
OODD
81
44
0
11 Feb 2022
Describing Differences between Text Distributions with Natural Language
Ruiqi Zhong
Charles Burton Snell
Dan Klein
Jacob Steinhardt
VLM
198
44
0
28 Jan 2022
Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions
Marwan Omar
Soohyeon Choi
Daehun Nyang
David A. Mohaisen
80
58
0
03 Jan 2022
Measure and Improve Robustness in NLP Models: A Survey
Xuezhi Wang
Haohan Wang
Diyi Yang
300
139
0
15 Dec 2021
Quantifying Adaptability in Pre-trained Language Models with 500 Tasks
Belinda Z. Li
Jane A. Yu
Madian Khabsa
Luke Zettlemoyer
A. Halevy
Jacob Andreas
ELM
89
17
0
06 Dec 2021
NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation
David Alfonso-Hermelo
Ahmad Rashid
Abbas Ghaddar
Huawei Noah’s
Mehdi Rezagholizadeh
77
2
0
09 Nov 2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
Wei Ping
Chejian Xu
Shuohang Wang
Zhe Gan
Yu Cheng
Jianfeng Gao
Ahmed Hassan Awadallah
Yangqiu Song
VLM
ELM
AAML
78
227
0
04 Nov 2021
IndoNLI: A Natural Language Inference Dataset for Indonesian
Rahmad Mahendra
Alham Fikri Aji
Samuel Louvan
Fahrurrozi Rahman
Clara Vania
70
32
0
27 Oct 2021
Behavioral Experiments for Understanding Catastrophic Forgetting
Samuel J. Bell
Neil D. Lawrence
82
4
0
20 Oct 2021
Retrieval-guided Counterfactual Generation for QA
Bhargavi Paranjape
Matthew Lamm
Ian Tenney
94
31
0
14 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
126
17
0
14 Oct 2021
ReaSCAN: Compositional Reasoning in Language Grounding
Zhengxuan Wu
Elisa Kreiss
Desmond C. Ong
Christopher Potts
CoGe
LRM
79
22
0
18 Sep 2021
Does External Knowledge Help Explainable Natural Language Inference? Automatic Evaluation vs. Human Ratings
Hendrik Schuff
Hsiu-yu Yang
Heike Adel
Ngoc Thang Vu
ELM
ReLM
LRM
62
13
0
16 Sep 2021
Types of Out-of-Distribution Texts and How to Detect Them
Udit Arora
William Huang
He He
OODD
281
101
0
14 Sep 2021
An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model
Kijong Han
Seojin Lee
Wooin Lee
Joosung Lee
Donghun Lee
AAML
38
5
0
10 Sep 2021
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning
Prasetya Ajie Utama
N. Moosavi
Victor Sanh
Iryna Gurevych
AAML
128
36
0
09 Sep 2021
Unsupervised Pre-training with Structured Knowledge for Improving Natural Language Inference
Xiaoyu Yang
Xiao-Dan Zhu
Zhan Shi
Tianda Li
SSL
54
1
0
08 Sep 2021
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
Amir Feder
Katherine A. Keith
Emaad A. Manzoor
Reid Pryzant
Dhanya Sridhar
...
Roi Reichart
Margaret E. Roberts
Brandon M Stewart
Victor Veitch
Diyi Yang
CML
118
246
0
02 Sep 2021
Grounding Representation Similarity with Statistical Testing
Frances Ding
Jean-Stanislas Denain
Jacob Steinhardt
87
30
0
03 Aug 2021
Stress Test Evaluation of Biomedical Word Embeddings
Vladimir Araujo
Andrés Carvallo
Carlos Aspillaga
C. Thorne
Denis Parra
44
8
0
24 Jul 2021
Tailor: Generating and Perturbing Text with Semantic Controls
Alexis Ross
Tongshuang Wu
Hao Peng
Matthew E. Peters
Matt Gardner
202
79
0
15 Jul 2021
An Investigation of the (In)effectiveness of Counterfactually Augmented Data
Nitish Joshi
He He
OODD
86
47
0
01 Jul 2021
Combining Feature and Instance Attribution to Detect Artifacts
Pouya Pezeshkpour
Sarthak Jain
Sameer Singh
Byron C. Wallace
TDI
127
42
0
01 Jul 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
111
95
0
30 Jun 2021
Probing Pre-Trained Language Models for Disease Knowledge
Israa Alghanmi
Luis Espinosa-Anke
Steven Schockaert
LM&MA
ELM
82
13
0
14 Jun 2021
Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP
Anthony Chen
Pallavi Gudipati
Shayne Longpre
Xiao Ling
Sameer Singh
73
40
0
12 Jun 2021
Figurative Language in Recognizing Textual Entailment
Tuhin Chakrabarty
Debanjan Ghosh
Adam Poliak
Smaranda Muresan
64
38
0
02 Jun 2021
SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics
Hitomi Yanaka
K. Mineshima
Kentaro Inui
NAI
AI4CE
117
11
0
02 Jun 2021
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests
Victor Veitch
Alexander DÁmour
Steve Yadlowsky
Jacob Eisenstein
OOD
82
93
0
31 May 2021
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
Zhiyi Ma
Kawin Ethayarajh
Tristan Thrush
Somya Jain
Ledell Yu Wu
Robin Jia
Christopher Potts
Adina Williams
Douwe Kiela
ELM
115
59
0
21 May 2021
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
Ruiqi Zhong
Dhruba Ghosh
Dan Klein
Jacob Steinhardt
91
36
0
13 May 2021
Understanding by Understanding Not: Modeling Negation in Language Models
Arian Hosseini
Siva Reddy
Dzmitry Bahdanau
R. Devon Hjelm
Alessandro Sordoni
Rameswar Panda
98
90
0
07 May 2021
Flexible Generation of Natural Language Deductions
Kaj Bostrom
Xinyu Zhao
Swarat Chaudhuri
Greg Durrett
ReLM
LRM
317
33
0
18 Apr 2021
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
...
Pontus Stenetorp
Robin Jia
Joey Tianyi Zhou
Christopher Potts
Adina Williams
218
411
0
07 Apr 2021
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Samuel R. Bowman
George E. Dahl
ELM
ALM
78
164
0
05 Apr 2021
Contrastive Explanations for Model Interpretability
Alon Jacovi
Swabha Swayamdipta
Shauli Ravfogel
Yanai Elazar
Yejin Choi
Yoav Goldberg
163
98
0
02 Mar 2021
NoiseQA: Challenge Set Evaluation for User-Centric Question Answering
Abhilasha Ravichander
Siddharth Dalmia
Maria Ryskina
Florian Metze
Eduard H. Hovy
A. Black
ELM
59
32
0
16 Feb 2021
Statistically Profiling Biases in Natural Language Reasoning Datasets and Models
Shanshan Huang
Kenny Q. Zhu
34
1
0
09 Feb 2021
SICKNL: A Dataset for Dutch Natural Language Inference
G. Wijnholds
M. Moortgat
119
26
0
14 Jan 2021
Robustness Gym: Unifying the NLP Evaluation Landscape
Karan Goel
Nazneen Rajani
Jesse Vig
Samson Tan
Jason M. Wu
Stephan Zheng
Caiming Xiong
Joey Tianyi Zhou
Christopher Ré
AAML
OffRL
OOD
199
140
0
13 Jan 2021
Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models
Tongshuang Wu
Marco Tulio Ribeiro
Jeffrey Heer
Daniel S. Weld
142
251
0
01 Jan 2021
HateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger
B. Vidgen
Dong Nguyen
Zeerak Talat
Helen Z. Margetts
J. Pierrehumbert
135
276
0
31 Dec 2020
DynaSent: A Dynamic Benchmark for Sentiment Analysis
Christopher Potts
Zhengxuan Wu
Atticus Geiger
Douwe Kiela
299
80
0
30 Dec 2020
Previous
1
2
3
Next