Learning in Compact Spaces with Approximately Normalized Transformers

28 May 2025

Main:9 Pages

14 Figures

Bibliography:3 Pages

5 Tables

Appendix:11 Pages

Abstract

In deep learning, regularization and normalization are common solutions for challenges such as overfitting, numerical instabilities, and the increasing variance in the residual stream. An alternative approach is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic but approximate normalization (anTransformer). Our approach constrains the norm of parameters and normalizes all representations via scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. When applied to GPT training, we observe a 40% faster convergence compared to models with QK normalization, with less than 3% additional runtime. Deriving scaling laws for anGPT, we found our method enables training with larger batch sizes and fewer hyperparameters, while matching the favorable scaling characteristics of classic GPT architectures.

View on arXiv

@article{franke2025_2505.22014,
  title={ Learning in Compact Spaces with Approximately Normalized Transformers },
  author={ Jörg K.H. Franke and Urs Spiegelhalter and Marianna Nezhurina and Jenia Jitsev and Frank Hutter and Michael Hefenbrock },
  journal={arXiv preprint arXiv:2505.22014},
  year={ 2025 }
}

Comments on this paper