649

Towards Improving Adversarial Training of NLP Models

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Abstract

Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models' performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP, which we name Attacking to Training (A2T\texttt{A2T}). The core part of A2T\texttt{A2T} is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T\texttt{A2T} to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results show that it is possible to train empirically robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T\texttt{A2T} can improve an NLP model's robustness to the attack it was originally trained with and also defend the model against other types of attacks. Furthermore, we show that A2T\texttt{A2T} can improve NLP models' standard accuracy, cross-domain generalization, and interpretability. Code is available at http://github.com/jinyongyoo/A2T .

View on arXiv
Comments on this paper