ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.21910
12
0

Taming Transformer Without Using Learning Rate Warmup

28 May 2025
Xianbiao Qi
Yelin He
Jiaquan Ye
Chun-Guang Li
Bojia Zi
Xili Dai
Qin Zou
Rong Xiao
ArXivPDFHTML
Abstract

Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textit{spectral energy concentration} of \bWq⊤\bWk{\bW_q}^{\top} \bW_k\bWq​⊤\bWk​, which is the reason for a malignant entropy collapse, where \bWq{\bW_q}\bWq​ and \bWk\bW_k\bWk​ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textit{Weyl's Inequality}, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth -- if the ratio σ1(∇\bWt)σ1(\bWt−1)\frac{\sigma_{1}(\nabla \bW_t)}{\sigma_{1}(\bW_{t-1})}σ1​(\bWt−1​)σ1​(∇\bWt​)​ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of σ1(\bWt−1)σ1(∇\bWt)\frac{\sigma_{1}(\bW_{t-1})}{\sigma_{1}(\nabla \bW_t)}σ1​(∇\bWt​)σ1​(\bWt−1​)​, where ∇\bWt\nabla \bW_t∇\bWt​ is the updating quantity in step ttt. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.

View on arXiv
@article{qi2025_2505.21910,
  title={ Taming Transformer Without Using Learning Rate Warmup },
  author={ Xianbiao Qi and Yelin He and Jiaquan Ye and Chun-Guang Li and Bojia Zi and Xili Dai and Qin Zou and Rong Xiao },
  journal={arXiv preprint arXiv:2505.21910},
  year={ 2025 }
}
Comments on this paper