Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, the neural networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer networks remains somewhat unsettled. In this work, we prove why simple algorithms such as stochastic gradient descent (SGD) can find on the training objective of DNNs in . We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: in , the number of DNN layers and in , the number of training samples. As concrete examples, on the training set and starting from randomly initialized weights, we show that SGD attains 100% accuracy in classification tasks, or minimizes regression loss in linear convergence speed , with a number of iterations that only scales polynomial in and . Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
View on arXiv