We study the dynamics of gradient descent on objective functions of the form (with respect to scalar parameters ), which arise in the context of training depth- linear neural networks. We prove that for standard random initializations, and under mild assumptions on , the number of iterations required for convergence scales exponentially with the depth . We also show empirically that this phenomenon can occur in higher dimensions, where each is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where is large.
View on arXiv