Swin DiT: Diffusion Transformer using Pseudo Shifted Windows

19 May 2025

Abstract

Diffusion Transformers (DiTs) achieve remarkable performance within the domain of image generation through the incorporation of the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global information modeling transformers, which face significant computational cost when processing high-resolution images. We empirically analyze that latent space image generation does not exhibit a strong dependence on global information as traditionally assumed. Most of the layers in the model demonstrate redundancy in global computation. In addition, conventional attention mechanisms exhibit low-frequency inertia issues. To address these issues, we propose \textbf{P}seudo \textbf{S}hifted \textbf{W}indow \textbf{A}ttention (PSWA), which fundamentally mitigates global model redundancy. PSWA achieves intermediate global-local information interaction through window attention, while employing a high-frequency bridging branch to simulate shifted window operations, supplementing appropriate global and high-frequency information. Furthermore, we propose the Progressive Coverage Channel Allocation(PCCA) strategy that captures high-order attention similarity without additional computational cost. Building upon all of them, we propose a series of Pseudo \textbf{S}hifted \textbf{Win}dow DiTs (\textbf{Swin DiT}), accompanied by extensive experiments demonstrating their superior performance. For example, our proposed Swin-DiT-L achieves a 54% $\uparrow$ FID improvement over DiT-XL/2 while requiring less computational.this https URL

View on arXiv

@article{wu2025_2505.13219,
  title={ Swin DiT: Diffusion Transformer using Pseudo Shifted Windows },
  author={ Jiafu Wu and Yabiao Wang and Jian Li and Jinlong Peng and Yun Cao and Chengjie Wang and Jiangning Zhang },
  journal={arXiv preprint arXiv:2505.13219},
  year={ 2025 }
}

Comments on this paper