ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.11648
21
81

Training (Overparametrized) Neural Networks in Near-Linear Time

20 June 2020
Jan van den Brand
Binghui Peng
Zhao-quan Song
Omri Weinstein
    ODL
ArXivPDFHTML
Abstract

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second\mathit{second}second-order\mathit{order}order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent\mathit{independent}independent of the training batch size nnn), second-order algorithms incur a daunting slowdown in the cost\mathit{cost}cost per\mathit{per}per iteration\mathit{iteration}iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an O(mn2)O(mn^2)O(mn2)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width mmm. We show how to speed up the algorithm of [CGH+19], achieving an O~(mn)\tilde{O}(mn)O~(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mnmnmn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an ℓ2\ell_2ℓ2​-regression problem, and then use a Fast-JL type dimension reduction to precondition\mathit{precondition}precondition the underlying Gram matrix in time independent of MMM, allowing to find a sufficiently good approximate solution via first\mathit{first}first-order\mathit{order}order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in convex\mathit{convex}convex optimization\mathit{optimization}optimization (ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well.

View on arXiv
Comments on this paper