ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.11840
4
0

On the O(dK1/4)O(\frac{\sqrt{d}}{K^{1/4}})O(K1/4d​​) Convergence Rate of AdamW Measured by ℓ1\ell_1ℓ1​ Norm

17 May 2025
Huan Li
Yiming Dong
Zhouchen Lin
ArXivPDFHTML
Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate 1K∑k=1KE[∥∇f(xk)∥1]≤O(dCK1/4)\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})K1​∑k=1K​E[∥∇f(xk)∥1​]≤O(K1/4d​C​) for AdamW measured by ℓ1\ell_1ℓ1​ norm, where KKK represents the iteration number, ddd denotes the model dimension, and CCC matches the constant in the optimal convergence rate of SGD. Theoretically, we have E[∥∇f(x)∥1]≥2dπE[∥∇f(x)∥2]E\left[\|\nabla f(x)\|_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[\|\nabla f(x)\|_2\right]E[∥∇f(x)∥1​]≥π2d​​E[∥∇f(x)∥2​] when each element of ∇f(x)\nabla f(x)∇f(x) is generated from Gaussian distribution N(0,1)\mathcal N(0,1)N(0,1). Empirically, our experimental results on real-world deep learning tasks reveal ∥∇f(x)∥1=Θ(d)∥∇f(x)∥2\|\nabla f(x)\|_1=\varTheta(\sqrt{d})\|\nabla f(x)\|_2∥∇f(x)∥1​=Θ(d​)∥∇f(x)∥2​. Both support that our convergence rate can be considered to be analogous to the optimal 1K∑k=1KE[∥∇f(xk)∥2]≤O(CK1/4)\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{K^{1/4}})K1​∑k=1K​E[∥∇f(xk)∥2​]≤O(K1/4C​) convergence rate of SGD.

View on arXiv
@article{li2025_2505.11840,
  title={ On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm },
  author={ Huan Li and Yiming Dong and Zhouchen Lin },
  journal={arXiv preprint arXiv:2505.11840},
  year={ 2025 }
}
Comments on this paper