20
22

Global Convergence of Gradient Descent for Deep Linear Residual Networks

Abstract

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an ε\varepsilon-optimal point in O(L3log(1/ε))O(L^3 \log(1/\varepsilon)) iterations, which scales polynomially with the network depth LL. Our result and the exp(Ω(L))\exp(\Omega(L)) convergence time for the standard initialization (Xavier or near-identity) [Shamir, 2018] together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when LL is large.

View on arXiv
Comments on this paper