Adversarial Robustness May Be at Odds With Simplicity

Current techniques in machine learning are so far are unable to learn classifiers that are robust to adversarial perturbations. However, they are able to learn non-robust classifiers with very high accuracy, even in the presence of random perturbations. Towards explaining this gap, we highlight the hypothesis that In this note, we show that this hypothesis is indeed possible, by giving several theoretical examples of classification tasks and sets of "simple" classifiers for which: (1) There exists a simple classifier with high standard accuracy, and also high accuracy under random noise. (2) Any simple classifier is not robust: it must have high adversarial loss with perturbations. (3) Robust classification is possible, but only with more complex classifiers (exponentially more complex, in some examples). Moreover, This suggests an alternate explanation of this phenomenon, which appears in practice: the tradeoff may occur not because the classification task inherently requires such a tradeoff (as in [Tsipras-Santurkar-Engstrom-Turner-Madry `18]), but because the structure of our current classifiers imposes such a tradeoff.
View on arXiv