74
9

Convergence of Alternating Gradient Descent for Matrix Factorization

Abstract

We consider alternating gradient descent (AGD) with fixed step size η>0\eta > 0, applied to the asymmetric matrix factorization objective. We show that, for a rank-rr matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, T=((σ1(A)σr(A))2log(1/ϵ))T = \left( \left(\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})}\right)^2 \log(1/\epsilon)\right) iterations of alternating gradient descent suffice to reach an ϵ\epsilon-optimal factorization AXTYTF2ϵAF2\| \mathbf{A} - \mathbf{X}_T^{\vphantom{\intercal}} \mathbf{Y}_T^{\intercal} \|_{\rm F}^2 \leq \epsilon \| \mathbf{A} \|_{\rm F}^2 with high probability starting from an atypical random initialization. The factors have rank d>rd>r so that XTRm×d\mathbf{X}_T\in\mathbb{R}^{m \times d} and YTRn×d\mathbf{Y}_T \in\mathbb{R}^{n \times d}. Experiments suggest that our proposed initialization is not merely of theoretical benefit, but rather significantly improves convergence of gradient descent in practice. Our proof is conceptually simple: a uniform PL-inequality and uniform Lipschitz smoothness constant are guaranteed for a sufficient number of iterations, starting from our random initialization. Our proof method should be useful for extending and simplifying convergence analyses for a broader class of nonconvex low-rank factorization problems.

View on arXiv
Comments on this paper