We consider how to measure the robustness of a neural network against adversarial examples. We introduce three new attack algorithms, tailored to three different distance metrics, to find adversarial examples: given an image x and a target class, we can find a new image x' that is similar to x but classified differently. We show that our attacks are significantly more powerful than previously published attacks: in particular, they find adversarial examples that are between 2 and 10 times closer. Then, we study defensive distillation, a recently proposed approach which increases the robustness of neural networks. Our attacks succeed with probability 200 higher than previous attacks against defensive distillation and effectively break defensive distillation, showing that it provides little added security. We hope our attacks will be used as a benchmark in future defense attempts to create neural networks that resist adversarial examples.
View on arXiv