233

Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK

Annual Conference Computational Learning Theory (COLT), 2020
Abstract

We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input xRdx\in\mathbb{R}^d is drawn from a Gaussian distribution and the label of xx satisfies f(x)=aWxf^{\star}(x) = a^{\top}|W^{\star}x|, where aRda\in\mathbb{R}^d is a nonnegative vector and WRd×dW^{\star} \in\mathbb{R}^{d\times d} is an orthonormal matrix. We show that an over-parametrized two-layer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most o(1/d)o(1/d) in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in dd, has population loss at least Ω(1/d)\Omega(1 / d).

View on arXiv
Comments on this paper