A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Samples

1 December 2016

Abstract

Adversarial samples are maliciously created inputs that lead a machine learning classifier to produce incorrect output labels. An adversarial sample is often generated by adding adversarial noise (AN) to a normal test sample. Recent literature has tried to analyze and harden learning-based classifiers under such AN. However, most previous studies are empirical and provide little understanding of the underlying reasons why many machine learning classifiers, including deep neural networks (DNNs), are vulnerable to AN. To fill this gap, we propose a theoretical framework using two topology spaces to understand classifiers' robustness against AN. The central idea of our work is that for a certain classification task, the robustness of a classifier $f_1$ against AN is decided by both $f_1$ and its oracle $f_2$ (such as a human annotator of that specific task). This motivates us to formulate a formal definition of "strong-robustness" that describes when a classifier $f_1$ is always robust against AN according to its $f_2$ . The second key piece of our framework is the decomposition of $f_i = c_i \circ g_i$ , in which $i \in {1,2}$ , $g_i$ includes feature learning operations and $c_i$ includes relatively simple decision functions for the classification. We theoretically prove that $f_1$ is strong-robust against AN $\Leftrightarrow$ a special topology relationship exists between the two feature spaces defined by $g_1$ and $g_2$ . Surprisingly, our theorems indicate that the strong-robustness of $f_1$ against AN is fully determined by its $g_1$ , not $c_1$ . Empirically we find that the Siamese architecture can intuitively help DNN models approach topological equivalence between the two feature spaces, which in turns effectively improves its robustness against AN.

View on arXiv

Comments on this paper