A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Samples

1 December 2016

Abstract

Adversarial samples are maliciously created inputs that lead a learning-based classifier to produce incorrect output labels. An adversarial sample is often generated by adding adversarial perturbation (AP) to a normal test sample. Recent studies that tried to analyze classifiers under such AP are mostly empirical and provide little understanding of why. To fill this gap, we propose a theoretical framework for analyzing learning-based classifiers, especially deep neural networks (DNN) in the face of such AP. By using concepts from topology, this framework brings forth the key reasons why an adversarial can fool a classifier ( $f_1$ ) and suggests a new focus on its oracle ( $f_2$ , like human annotators of that specific task). By investigating the topology relationship between two (pseudo)metric spaces corresponding to predictor $f_1$ and oracle $f_2$ , we develop several ideal conditions that can determine if $f_1$ is always robust (strong-robust) against adversarial samples according to its $f_2$ . The theoretical framework leads to a set of novel and complementary insights that have not been uncovered by the literature. Surprisingly our theorems find that just one extra irrelevant feature can make a classifier not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong robust. Empirically we find that "Siamese architecture" can be used to help DNN models get close to the desired topological relationship for strong-robustness, which in turn effectively improves its performance against AP.

View on arXiv

Comments on this paper