Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization

16 March 2025

Abstract

Optimization with matrix gradient orthogonalization has recently demonstrated impressive results in the training of deep neural networks (Jordan et al., 2024; Liu et al., 2025). In this paper, we provide a theoretical analysis of this approach. In particular, we show that the orthogonalized gradient method can be seen as a first-order trust-region optimization method, where the trust-region is defined in terms of the matrix spectral norm. Motivated by this observation, we develop the stochastic non-Euclidean trust-region gradient method with momentum, which recovers the Muon optimizer (Jordan et al., 2024) as a special case, along with normalized SGD and signSGD with momentum (Cutkosky and Mehta, 2020; Sun et al., 2023). In addition, we prove state-of-the-art convergence results for the proposed algorithm in a range of scenarios, which involve arbitrary non-Euclidean norms, constrained and composite problems, and non-convex, star-convex, first- and second-order smooth functions. Finally, our theoretical findings provide an explanation for several practical observations, including the practical superiority of Muon compared to the Orthogonal-SGDM algorithm of Tuddenham et al. (2022) and the importance of weight decay in the training of large-scale language models.

View on arXiv

@article{kovalev2025_2503.12645,
  title={ Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization },
  author={ Dmitry Kovalev },
  journal={arXiv preprint arXiv:2503.12645},
  year={ 2025 }
}

Comments on this paper