ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.12191
11
241

Linearized two-layers neural networks in high dimension

27 April 2019
Behrooz Ghorbani
Song Mei
Theodor Misiakiewicz
Andrea Montanari
    MLT
ArXivPDFHTML
Abstract

We consider the problem of learning an unknown function f⋆f_{\star}f⋆​ on the ddd-dimensional sphere with respect to the square loss, given i.i.d. samples {(yi,xi)}i≤n\{(y_i,{\boldsymbol x}_i)\}_{i\le n}{(yi​,xi​)}i≤n​ where xi{\boldsymbol x}_ixi​ is a feature vector uniformly distributed on the sphere and yi=f⋆(xi)+εiy_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_iyi​=f⋆​(xi​)+εi​. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons NNN diverges, for a fixed dimension ddd. We consider two specific regimes: the approximation-limited regime, in which n=∞n=\inftyn=∞ while ddd and NNN are large but finite; and the sample size-limited regime in which N=∞N=\inftyN=∞ while ddd and nnn are large but finite. In the first regime we prove that if dℓ+δ≤N≤dℓ+1−δd^{\ell + \delta} \le N\le d^{\ell+1-\delta}dℓ+δ≤N≤dℓ+1−δ for small δ>0\delta > 0δ>0, then \RF\, effectively fits a degree-ℓ\ellℓ polynomial in the raw features, and \NT\, fits a degree-(ℓ+1)(\ell+1)(ℓ+1) polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is dℓ+δ≤n≤dℓ+1−δd^{\ell + \delta} \le n \le d^{\ell +1-\delta}dℓ+δ≤n≤dℓ+1−δ, then kernel methods can fit at most a a degree-ℓ\ellℓ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

View on arXiv
Comments on this paper