ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.22014
52
0

Learning in Compact Spaces with Approximately Normalized Transformers

28 May 2025
Jörg Franke
Urs Spiegelhalter
Marianna Nezhurina
J. Jitsev
Frank Hutter
Michael Hefenbrock
ArXiv (abs)PDFHTML
Main:9 Pages
14 Figures
Bibliography:3 Pages
5 Tables
Appendix:11 Pages
Abstract

In deep learning, regularization and normalization are common solutions for challenges such as overfitting, numerical instabilities, and the increasing variance in the residual stream. An alternative approach is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic but approximate normalization (anTransformer). Our approach constrains the norm of parameters and normalizes all representations via scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. When applied to GPT training, we observe a 40% faster convergence compared to models with QK normalization, with less than 3% additional runtime. Deriving scaling laws for anGPT, we found our method enables training with larger batch sizes and fewer hyperparameters, while matching the favorable scaling characteristics of classic GPT architectures.

View on arXiv
@article{franke2025_2505.22014,
  title={ Learning in Compact Spaces with Approximately Normalized Transformers },
  author={ Jörg K.H. Franke and Urs Spiegelhalter and Marianna Nezhurina and Jenia Jitsev and Frank Hutter and Michael Hefenbrock },
  journal={arXiv preprint arXiv:2505.22014},
  year={ 2025 }
}
Comments on this paper