Blink: Fast and Generic Collectives for Distributed ML

Blink: Fast and Generic Collectives for Distributed ML

11 October 2019

Shivaram Venkataraman

Amar Phanishayee

Nikhil R. Devanur

Papers citing "Blink: Fast and Generic Collectives for Distributed ML"

14 / 14 papers shown

Title
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices Si Ung Noh Junguk Hong Chaemin Lim Seong-Yeol Park Jeehyun Kim Hanjun Kim Youngsok Kim Jinho Lee 34 7 0 13 Apr 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey Feng Liang Zhen Zhang Haifeng Lu Victor C. M. Leung Yanyi Guo Xiping Hu GNN 37 6 0 09 Apr 2024
Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression Jaeyong Song Jinkyu Yim Jaewon Jung Hongsun Jang H. Kim Youngsok Kim Jinho Lee GNN 24 25 0 24 Jan 2023
Efficient All-reduce for Distributed DNN Training in Optical Interconnect System Fei Dai Yawen Chen Zhiyi Huang Haibo Zhang Fangfang Zhang 11 7 0 22 Jul 2022
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs Tarannum Khan Saeed Rashidi Srinivas Sridharan Pallavi Shurpali Aditya Akella T. Krishna OOD 34 11 0 22 Jul 2022
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud Zhen Zhang Shuai Zheng Yida Wang Justin Chiu George Karypis Trishul Chilimbi Mu Li Xin Jin 19 39 0 30 Apr 2022
Efficient Direct-Connect Topologies for Collective Communications Liangyu Zhao Siddharth Pal Tapan Chugh Weiyang Wang Jason Fantl P. Basu J. Khoury Arvind Krishnamurthy 42 6 0 07 Feb 2022
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs Weiyang Wang Moein Khazraee Zhizhen Zhong M. Ghobadi Zhihao Jia Dheevatsa Mudigere Ying Zhang A. Kewitsch 39 81 0 01 Feb 2022
Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models Saeed Rashidi William Won Sudarshan Srinivasan Srinivas Sridharan T. Krishna GNN 30 29 0 09 Oct 2021
Scalable and accurate multi-GPU based image reconstruction of large-scale ptychography data Xiaodong Yu Viktor V. Nikitin Daniel J. Ching Selin S. Aslan D. Gursoy Tekin Bicer 27 19 0 14 Jun 2021
Synthesizing Optimal Collective Algorithms Zixian Cai Zhengyang Liu Saeed Maleki Madan Musuvathi Todd Mytkowicz Jacob Nelson Olli Saarikivi GNN 26 59 0 19 Aug 2020
Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems Siyuan Zhuang Zhuohan Li Danyang Zhuo Stephanie Wang Eric Liang Robert Nishihara Philipp Moritz Ion Stoica 24 23 0 13 Feb 2020
Pipelined Training with Stale Weights of Deep Convolutional Neural Networks Lifu Zhang T. Abdelrahman 21 0 0 29 Dec 2019
Taming Momentum in a Distributed Asynchronous Environment Ido Hakimi Saar Barkai Moshe Gabel Assaf Schuster 19 23 0 26 Jul 2019