ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.04642
43
2

The Optimization Landscape of SGD Across the Feature Learning Strength

6 October 2024
Alexander B. Atanasov
Alexandru Meterez
James B. Simon
Cengiz Pehlevan
ArXivPDFHTML
Abstract

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter γ\gammaγ. Recent work has identified γ\gammaγ as controlling the strength of feature learning. As γ\gammaγ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling γ\gammaγ across a variety of models and datasets in the online training setting. We first examine the interaction of γ\gammaγ with the learning rate η\etaη, identifying several scaling regimes in the γ\gammaγ-η\etaη plane which we explain theoretically using a simple model. We find that the optimal learning rate η∗\eta^*η∗ scales non-trivially with γ\gammaγ. In particular, η∗∝γ2\eta^* \propto \gamma^2η∗∝γ2 when γ≪1\gamma \ll 1γ≪1 and η∗∝γ2/L\eta^* \propto \gamma^{2/L}η∗∝γ2/L when γ≫1\gamma \gg 1γ≫1 for a feed-forward network of depth LLL. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" γ≫1\gamma \gg 1γ≫1 regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large γ\gammaγ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large γ\gammaγ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-γ\gammaγ limit may yield useful insights into the dynamics of representation learning in performant models.

View on arXiv
@article{atanasov2025_2410.04642,
  title={ The Optimization Landscape of SGD Across the Feature Learning Strength },
  author={ Alexander Atanasov and Alexandru Meterez and James B. Simon and Cengiz Pehlevan },
  journal={arXiv preprint arXiv:2410.04642},
  year={ 2025 }
}
Comments on this paper