ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.18277
45
0

Self-Adjust Softmax

25 February 2025
Chuanyang Zheng
Yihang Gao
Guoxuan Chen
Han Shi
Jing Xiong
Xiaozhe Ren
Chao Huang
Xin Jiang
Z. Li
Yu-Hu Li
ArXivPDFHTML
Abstract

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying softmax(x)softmax(x)softmax(x) to x⋅softmax(x)x \cdot softmax(x)x⋅softmax(x) and its normalized variant (x−min(xmin⁡,0))max(0,xmax)−min(xmin,0)⋅softmax(x)\frac{(x - min(x_{\min},0))}{max(0,x_{max})-min(x_{min},0)} \cdot softmax(x)max(0,xmax​)−min(xmin​,0)(x−min(xmin​,0))​⋅softmax(x). We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.

View on arXiv
@article{zheng2025_2502.18277,
  title={ Self-Adjust Softmax },
  author={ Chuanyang Zheng and Yihang Gao and Guoxuan Chen and Han Shi and Jing Xiong and Xiaozhe Ren and Chao Huang and Xin Jiang and Zhenguo Li and Yu Li },
  journal={arXiv preprint arXiv:2502.18277},
  year={ 2025 }
}
Comments on this paper