Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers
- DiffM

Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by 12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring 5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.
View on arXiv@article{bill2025_2505.19122, title={ Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers }, author={ Eric Tillman Bill and Cristian Perez Jensen and Sotiris Anagnostidis and Dimitri von Rütte }, journal={arXiv preprint arXiv:2505.19122}, year={ 2025 } }