Deep neural networks have relieved the feature engineering burden on human experts. However, comparable efforts are instead required to determine an effective architecture. In addition, as the sizes of networks have over-grown, a considerable amount of resources is also invested in reducing the sizes. The sparsification of an over-complete model addresses these problems as it removes redundant parameters or connections by pruning them away after training or encouraging them to become zero during training. However, these approaches are not fully differentiable and interrupt an end-to-end training process with the stochastic gradient descent in that they require either a parameter selection or soft-thresholding step. In this study, we proposed a fully differentiable sparsification method for deep neural networks, which allows parameters to be exactly zero during training. The proposed method can simultaneously learn the sparsified structure and weights of networks using the stochastic gradient descent.
View on arXiv