Understanding Frank-Wolfe Adversarial Training

22 December 2020

Abstract

Deep neural networks are easily fooled by small perturbations known as adversarial attacks. Adversarial Training (AT) is a technique that approximately solves a robust optimization problem to minimize the worst-case loss and is widely regarded as the most effective defense against such attacks. While projected gradient descent (PGD) has received most attention for approximately solving the inner maximization of AT, Frank-Wolfe (FW) optimization is projection-free and can be adapted to any $\ell_p$ norm. A Frank-Wolfe adversarial training approach is presented and is shown to provide as competitive level of robustness as PGD-AT for a variety of architectures, attacks, and datasets. Exploiting a representation of the FW attack we are able to derive the geometric insight that: The larger the $\ell_2$ norm of an $\ell_\infty$ attack is, the less loss gradient variation there is. It is then experimentally demonstrated that $\ell_\infty$ attacks against robust models achieve near the maximal possible $\ell_2$ distortion, providing a new lens into the specific type of regularization that AT bestows. Using FW optimization in conjunction with robust models, we are able to generate sparse human-interpretable counterfactual explanations without relying on expensive $\ell_1$ projections.

View on arXiv

Comments on this paper