Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
- MLT

Neural networks have great success in many machine learning applications, but the fundamental learning theory behind them remains largely unsolved. Learning neural networks is NP-hard, but in practice, simple algorithms like stochastic gradient descent (SGD) often produce good solutions. Moreover, it is observed that overparameterization (that is, designing networks whose number of parameters is larger than statistically needed to perfectly fit the training data) improves both optimization and generalization, appearing to contradict traditional learning theory. In this work, we prove that using overparameterized neural networks with rectified linear units, one can (improperly) learn some notable hypothesis classes, including two and three-layer neural networks with fewer parameters and smooth activations. Moreover, the learning process can be simply done by SGD or its variants in polynomial time using polynomially many samples. We also show that for a fixed sample size, the population risk of the solution found by some SGD variant can be made almost independent of the number of parameters in the overparameterized network.
View on arXiv