31
28

Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

Abstract

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input xRdx\in\mathbb{R}^d is drawn from a Gaussian distribution and the label of xx satisfies f(x)=aWxf^{\star}(x) = a^{\top}|W^{\star}x|, where aRda\in\mathbb{R}^d is a nonnegative vector and WRd×dW^{\star} \in\mathbb{R}^{d\times d} is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most o(1/d)o(1/d) in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in dd, has population loss at least Ω(1/d)\Omega(1 / d).

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.