137
11

Understanding Frank-Wolfe Adversarial Training

Abstract

Deep neural networks are easily fooled by small perturbations known as adversarial attacks. Adversarial Training (AT) is a technique that approximately solves a robust optimization problem to minimize the worst-case loss and is widely regarded as the most effective defense against such attacks. While projected gradient descent (PGD) has received most attention for approximately solving the inner maximization of AT, Frank-Wolfe (FW) optimization is projection-free and can be adapted to any p\ell_p norm. A Frank-Wolfe adversarial training approach is presented and is shown to provide as competitive level of robustness as PGD-AT for a variety of architectures, attacks, and datasets. Exploiting a representation of the FW attack we are able to derive the geometric insight that: The larger the 2\ell_2 norm of an \ell_\infty attack is, the less loss gradient variation there is. It is then experimentally demonstrated that \ell_\infty attacks against robust models achieve near the maximal possible 2\ell_2 distortion, providing a new lens into the specific type of regularization that AT bestows. Using FW optimization in conjunction with robust models, we are able to generate sparse human-interpretable counterfactual explanations without relying on expensive 1\ell_1 projections.

View on arXiv
Comments on this paper