Using Honeypots to Catch Adversarial Attacks on Neural Networks
Deep neural networks are known to be vulnerable to adversarial attacks. Numerous efforts have focused on defenses that either try to patch "holes" in trained models, or try to make it difficult or costly to compute adversarial examples that exploit holes. In our work, we explore a counter-intuitive approach of creating a trapdoor-enabled defense. Unlike prior works that try to patch or disguise vulnerable points in the model, we intentionally inject "trapdoors," user-induced weaknesses in the classification manifold that, like honeypots, attract attackers searching for adversarial examples. Attackers' algorithms naturally gravitate towards our trapdoors, leading them to produce attacks that are similar to the trapdoors in the feature space. These are easily identified through similarity of neuron activation signatures compared to those of trapdoors. In this paper, we introduce trapdoors and describe an implementation of a trapdoor-enabled defense. First, we analyze trapdoors as a defense. We prove that not only do trapdoors successfully shape the computation of adversarial attacks, but also that resulting attack inputs will have feature representations very similar to those of inputs with trapdoors. Second, we experimentally show that by protecting models with trapdoors, we can detect, with high accuracy, adversarial examples generated by state-of-the-art attacks (Projected Gradient Descent, optimization-based CW, Elastic Net, BPDA), with negligible impact on the classification of normal inputs. These results generalize across classification domains, including image, facial, and traffic-sign recognition. We also evaluate and validate trapdoors' robustness against multiple strong adaptive attacks.
View on arXiv