Accelerating AllReduce with a Persistent Straggler

29 May 2025

Main:10 Pages

12 Figures

Bibliography:6 Pages

1 Tables

Appendix:7 Pages

Abstract

Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.

View on arXiv

@article{devraj2025_2505.23523,
  title={ Accelerating AllReduce with a Persistent Straggler },
  author={ Arjun Devraj and Eric Ding and Abhishek Vijaya Kumar and Robert Kleinberg and Rachee Singh },
  journal={arXiv preprint arXiv:2505.23523},
  year={ 2025 }
}

Comments on this paper