ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.05697
20
20

Generalization Bounds for Neural Networks via Approximate Description Length

13 October 2019
Amit Daniely
Elad Granot
ArXivPDFHTML
Abstract

We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H=\left\{W_t\circ\rho\circ \ldots\circ\rho\circ W_{1} :W_1,\ldots,W_{t-1}\in M_{d, d}, W_t\in M_{1,d}\right\} \] where the spectral norm of each WiW_iWi​ is bounded by O(1)O(1)O(1), the Frobenius norm is bounded by RRR, and ρ\rhoρ is the sigmoid function ex1+ex\frac{e^x}{1+e^x}1+exex​ or the smoothened ReLU function ln⁡(1+ex) \ln (1+e^x)ln(1+ex). We show that for any depth ttt, if the inputs are in [−1,1]d[-1,1]^d[−1,1]d, the sample complexity of HHH is O~(dR2ϵ2)\tilde O\left(\frac{dR^2}{\epsilon^2}\right)O~(ϵ2dR2​). This bound is optimal up to log-factors, and substantially improves over the previous state of the art of O~(d2R2ϵ2)\tilde O\left(\frac{d^2R^2}{\epsilon^2}\right)O~(ϵ2d2R2​). We furthermore show that this bound remains valid if instead of considering the magnitude of the WiW_iWi​'s, we consider the magnitude of Wi−Wi0W_i - W_i^0Wi​−Wi0​, where Wi0W_i^0Wi0​ are some reference matrices, with spectral norm of O(1)O(1)O(1). By taking the Wi0W_i^0Wi0​ to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families HHH of predictors. We start by defining a new notion of a randomized approximate description of functions f:X→Rdf:X\to\mathbb{R}^df:X→Rd. We then show that if there is a way to approximately describe functions in a class HHH using ddd bits, then d/ϵ2d/\epsilon^2d/ϵ2 examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is ϵ\epsilonϵ-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.

View on arXiv
Comments on this paper