ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.15743
30
5

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

24 May 2024
Nolan Dey
Shane Bergsma
Joel Hestness
ArXivPDFHTML
Abstract

Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose Sμ\muμPar as one such approach. Sμ\muμPar ensures activations, gradients, and weight updates all scale independently of sparsity level. Further, by reparameterizing the HPs, Sμ\muμPar enables the same HP values to be optimal as we vary both sparsity level and model width. HPs can be tuned on small dense networks and transferred to large sparse models, greatly reducing tuning costs. On large-scale language modeling, Sμ\muμPar training improves loss by up to 8.2% over the common approach of using the dense model standard parameterization.

View on arXiv
Comments on this paper