Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

25 May 2025

Main:5 Pages

15 Figures

Bibliography:2 Pages

7 Tables

Appendix:20 Pages

Abstract

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by $\sim$ 12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring $\sim$ 5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.

View on arXiv

@article{bill2025_2505.19122,
  title={ Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers },
  author={ Eric Tillman Bill and Cristian Perez Jensen and Sotiris Anagnostidis and Dimitri von Rütte },
  journal={arXiv preprint arXiv:2505.19122},
  year={ 2025 }
}

Comments on this paper