Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input is drawn from a Gaussian distribution and the label of satisfies , where is a nonnegative vector and is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in , has population loss at least .
View on arXiv