A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Samples
- AAML

Adversarial samples are maliciously created inputs that lead a learning-based classifier to produce incorrect output labels. An adversarial sample is often generated by adding adversarial perturbation (AP) to a normal test sample. Recent studies that tried to analyze classifiers under such AP are mostly empirical and provide little understanding of why. To fill this gap, we propose a theoretical framework for analyzing learning-based classifiers, especially deep neural networks (DNN) in the face of such AP. By using concepts from topology, this framework brings forth the key reasons why an adversarial can fool a classifier () and suggests a new focus on its oracle (, like human annotators of that specific task). By investigating the topology relationship between two (pseudo)metric spaces corresponding to predictor and oracle , we develop several ideal conditions that can determine if is always robust (strong-robust) against adversarial samples according to its . The theoretical framework leads to a set of novel and complementary insights that have not been uncovered by the literature. Surprisingly our theorems find that just one extra irrelevant feature can make a classifier not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong robust. Empirically we find that "Siamese architecture" can be used to help DNN models get close to the desired topological relationship for strong-robustness, which in turn effectively improves its performance against AP.
View on arXiv