ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.15080
91
0
v1v2 (latest)

SUS backprop: linear backpropagation algorithm for long inputs in transformers

21 May 2025
Sergey Pankov
Georges Harik
ArXiv (abs)PDFHTML
Main:12 Pages
9 Figures
Bibliography:5 Pages
Appendix:4 Pages
Abstract

It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of backpropagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length nnn. At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule controlled by a single parameter ccc that cuts back-propagation through most attention weights, leaving at most ccc interactions per token per attention head. This brings a factor of c/nc/nc/n reduction in the compute required for the attention backpropagation, turning it from quadratic O(n2)O(n^2)O(n2) to linear complexity O(nc)O(nc)O(nc). We have empirically verified that, for a typical transformer model, cutting about 99%99\%99% of the attention gradient flow (i.e. choosing c∼25−30c \sim 25-30c∼25−30) results in relative gradient variance increase of only about 1%1\%1% for n∼2000n \sim 2000n∼2000, and it decreases with nnn. This approach is amenable to efficient sparse matrix implementation, thus being promising for making the cost of a backward pass negligible relative to the cost of a forward pass when training a transformer model on long sequences.

View on arXiv
@article{pankov2025_2505.15080,
  title={ SUS backprop: linear backpropagation algorithm for long inputs in transformers },
  author={ Sergey Pankov and Georges Harik },
  journal={arXiv preprint arXiv:2505.15080},
  year={ 2025 }
}
Comments on this paper