ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1802.06093
23
116

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

16 February 2018
Peter L. Bartlett
D. Helmbold
Philip M. Long
ArXivPDFHTML
Abstract

We analyze algorithms for approximating a function f(x)=Φxf(x) = \Phi xf(x)=Φx mapping ℜd\Re^dℜd to ℜd\Re^dℜd using deep linear neural networks, i.e. that learn a function hhh parameterized by matrices Θ1,...,ΘL\Theta_1,...,\Theta_LΘ1​,...,ΘL​ and defined by h(x)=ΘLΘL−1...Θ1xh(x) = \Theta_L \Theta_{L-1} ... \Theta_1 xh(x)=ΘL​ΘL−1​...Θ1​x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix Φ\PhiΦ, in the case where the initial hypothesis Θ1=...=ΘL=I\Theta_1 = ... = \Theta_L = IΘ1​=...=ΘL​=I has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for Φ\PhiΦ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ\PhiΦ is symmetric positive definite, we show that an algorithm that initializes Θi=I\Theta_i = IΘi​=I learns an ϵ\epsilonϵ-approximation of fff using a number of updates polynomial in LLL, the condition number of Φ\PhiΦ, and log⁡(d/ϵ)\log(d/\epsilon)log(d/ϵ). In contrast, we show that if the least squares matrix Φ\PhiΦ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ\PhiΦ satisfies u⊤Φu>0u^{\top} \Phi u > 0u⊤Φu>0 for all uuu, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u⊤ΘLΘL−1...Θ1u>0u^{\top} \Theta_L \Theta_{L-1} ... \Theta_1 u > 0u⊤ΘL​ΘL−1​...Θ1​u>0 for all uuu, and another that "balances" Θ1,...,ΘL\Theta_1, ..., \Theta_LΘ1​,...,ΘL​ so that they have the same singular values.

View on arXiv
Comments on this paper