A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Samples

1 December 2016

Abstract

Adversarial samples are maliciously created inputs that lead a machine learning classifier to produce incorrect output labels. An adversarial sample is often generated by adding adversarial noise (AN) to a normal test sample. Recent literature has tried to analyze and harden learning-based classifiers under such AN. However, previous studies are mostly empirical and provide little understanding of why many learning-based classifiers, including deep neural networks (DNNs), are vulnerable to AN. To fill this gap, we propose a theoretical framework using the notation of topological spaces to uncover such reasons. The central idea of our work is that for a certain classification task, the robustness of a classifier $f_1$ against AN is decided by both $f_1$ and its oracle $f_2$ (such as human annotators of that specific task). This motivates us to formulate a formal definition of "strong-robustness" that describes when a classifier $f_1$ is always robust against AN according to its $f_2$ . The second key piece of our framework is the decomposition of $f_i = c_i \circ g_i$ , in which $i \in \{1,2\}$ , $g_i$ includes feature learning operations and $c_i$ includes relatively simple decision functions for the classification. We theoretically prove that $f_1$ is strong-robust against AN if and only if a special topological relationship exists between the two feature spaces defined by $g_1$ and $g_2$ . Theorems of our framework provide two important insights: (1) The strong-robustness of $f_1$ is fully determined by its $g_1$ , not $c_1$ . (2) Extra irrelevant features ruin the strong-robustness of $f_1$ . Empirically we find that the Siamese architecture can intuitively help DNN models approach the desired topological relationship for strong-robustness, which in turn effectively improves its robustness against AN.

View on arXiv

Comments on this paper