ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1901.04359
32
134

A Distributed Synchronous SGD Algorithm with Global Top-kkk Sparsification for Low Bandwidth Networks

14 January 2019
S. Shi
Qiang-qiang Wang
Kaiyong Zhao
Zhenheng Tang
Yuxin Wang
Xiang Huang
Xiaowen Chu
ArXivPDFHTML
Abstract

Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-kkk sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers. Top-kkk sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their irregular indices, which makes the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of O(kP)O(kP)O(kP), where PPP is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all top-kkk gradients from PPP workers are needed for the model update, and therefore we propose a novel global Top-kkk (gTop-kkk) sparsification mechanism to address the problem. Specifically, we choose global top-kkk largest absolute values of gradients from PPP workers, instead of accumulating all local top-kkk gradients to update the model in each iteration. The gradient aggregation method based on gTop-kkk sparsification reduces the communication complexity from O(kP)O(kP)O(kP) to O(klog⁡P)O(k\log P)O(klogP). Through extensive experiments on different DNNs, we verify that gTop-kkk S-SGD has nearly consistent convergence performance with S-SGD, and it has only slight degradations on generalization performance. In terms of scaling efficiency, we evaluate gTop-kkk on a cluster with 32 GPU machines which are interconnected with 1 Gbps Ethernet. The experimental results show that our method achieves 2.7−12×2.7-12\times2.7−12× higher scaling efficiency than S-SGD and 1.1−1.7×1.1-1.7\times1.1−1.7× improvement than the existing Top-kkk S-SGD.

View on arXiv
Comments on this paper