Decoupled Weight Decay for Any Norm

With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or regularization during training. We introduce a novel weight decay scheme, which generalizes the standard weight decay to any norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard regularization.
View on arXiv