ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1811.03962
256
1463
v1v2v3v4v5 (latest)

A Convergence Theory for Deep Learning via Over-Parameterization

9 November 2018
Zeyuan Allen-Zhu
Yuanzhi Li
Zhao Song
    AI4CEODL
ArXiv (abs)PDFHTML
Abstract

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, the neural networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer networks remains somewhat unsettled. In this work, we prove why simple algorithms such as stochastic gradient descent (SGD) can find global minima\textit{global minima}global minima on the training objective of DNNs in polynomial time\textit{polynomial time}polynomial time. We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: polynomial\textit{polynomial}polynomial in LLL, the number of DNN layers and in nnn, the number of training samples. As concrete examples, on the training set and starting from randomly initialized weights, we show that SGD attains 100% accuracy in classification tasks, or minimizes regression loss in linear convergence speed ε∝e−Ω(T)\varepsilon \propto e^{-\Omega(T)}ε∝e−Ω(T), with a number of iterations that only scales polynomial in nnn and LLL. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

View on arXiv
Comments on this paper