19
2

Decoupled Weight Decay for Any pp Norm

Abstract

With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or LpL_p regularization during training. We introduce a novel weight decay scheme, which generalizes the standard L2L_2 weight decay to any pp norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with 0<p<10<p<1 norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard L2L_2 regularization.

View on arXiv
Comments on this paper