ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2202.00433
  4. Cited By
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for
  Distributed Training Jobs

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

1 February 2022
Weiyang Wang
Moein Khazraee
Zhizhen Zhong
M. Ghobadi
Zhihao Jia
Dheevatsa Mudigere
Ying Zhang
A. Kewitsch
ArXivPDFHTML

Papers citing "TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs"

23 / 23 papers shown
Title
Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
Jinsun Yoo
ChonLam Lao
Lianjie Cao
Bob Lantz
Minlan Yu
Tushar Krishna
Puneet Sharma
52
0
0
29 Apr 2025
Routing for Large ML Models
Ofir Cohen
Jose Yallouz Michael Schapira
Shahar Belkar
Tal Mizrahi
60
0
0
07 Mar 2025
mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training
mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training
Xudong Liao
Yijun Sun
Han Tian
Xinchen Wan
Yilun Jin
...
Guyue Liu
Ying Zhang
Xiaofeng Ye
Yiming Zhang
Kai Chen
MoE
37
0
0
08 Jan 2025
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale
  Model-in-Network Data-Parallel Training on Distributed GPUs
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
Mo Sun
Zihan Yang
Changyue Liao
Yingtao Li
Fei Wu
Zeke Wang
60
1
0
02 Sep 2024
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
73
8
0
29 Jul 2024
VcLLM: Video Codecs are Secretly Tensor Codecs
VcLLM: Video Codecs are Secretly Tensor Codecs
Ceyu Xu
Yongji Wu
Xinyu Yang
Beidi Chen
Matthew Lentz
Danyang Zhuo
Lisa Wu Wills
50
0
0
29 Jun 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A
  Comprehensive Survey
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
37
6
0
09 Apr 2024
MOPAR: A Model Partitioning Framework for Deep Learning Inference
  Services on Serverless Platforms
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
Jiaang Duan
Shiyou Qian
Dingyu Yang
Hanwen Hu
Jian Cao
Guangtao Xue
MoE
37
1
0
03 Apr 2024
Communication Optimization for Distributed Training: Architecture,
  Advances, and Opportunities
Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities
Yunze Wei
Tianshuo Hu
Cong Liang
Yong Cui
AI4CE
39
0
0
12 Mar 2024
MLTCP: Congestion Control for DNN Training
MLTCP: Congestion Control for DNN Training
S. Rajasekaran
Sanjoli Narang
Anton A. Zabreyko
M. Ghobadi
17
1
0
14 Feb 2024
ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics
ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics
Liangyu Zhao
Saeed Maleki
Ziyue Yang
Hossein Pourreza
Aashaka Shah
33
0
0
09 Feb 2024
Swing: Short-cutting Rings for Higher Bandwidth Allreduce
Swing: Short-cutting Rings for Higher Bandwidth Allreduce
Daniele De Sensi
Tommaso Bonato
D. Saam
Torsten Hoefler
30
6
0
17 Jan 2024
Holmes: Towards Distributed Training Across Clusters with Heterogeneous
  NIC Environment
Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Fei Yang
Shuang Peng
Ning Sun
Fangyu Wang
Ke Tan
Fu Wu
Jiezhong Qiu
Aimin Pan
27
4
0
06 Dec 2023
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model
  Acceleration on Distributed Systems
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Samuel Hsia
Alicia Golden
Bilge Acun
Newsha Ardalani
Zach DeVito
Gu-Yeon Wei
David Brooks
Carole-Jean Wu
MoE
45
9
0
04 Oct 2023
Efficient All-to-All Collective Communication Schedules for
  Direct-Connect Topologies
Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies
P. Basu
Liangyu Zhao
Jason Fantl
Siddharth Pal
Arvind Krishnamurthy
J. Khoury
30
7
0
24 Sep 2023
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
S. Rajasekaran
M. Ghobadi
Aditya Akella
GNN
19
29
0
01 Aug 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed
  Machine Learning
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
27
6
0
11 Apr 2023
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning
  with Hardware Support for Embeddings
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
N. Jouppi
George Kurian
Sheng Li
Peter C. Ma
R. Nagarajan
...
Brian Towles
C. Young
Xiaoping Zhou
Zongwei Zhou
David A. Patterson
BDL
VLM
43
336
0
04 Apr 2023
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic
  Compression
THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
Minghao Li
Ran Ben-Basat
S. Vargaftik
Chon-In Lao
Ke Xu
Michael Mitzenmacher
Minlan Yu Harvard University
26
15
0
16 Feb 2023
Efficient Direct-Connect Topologies for Collective Communications
Efficient Direct-Connect Topologies for Collective Communications
Liangyu Zhao
Siddharth Pal
Tapan Chugh
Weiyang Wang
Jason Fantl
P. Basu
J. Khoury
Arvind Krishnamurthy
22
6
0
07 Feb 2022
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology
  Optimization for Distributed Training of Large AI Models
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
William Won
Saeed Rashidi
Sudarshan Srinivasan
T. Krishna
AI4CE
22
7
0
24 Sep 2021
Deep Learning Training in Facebook Data Centers: Design of Scale-up and
  Scale-out Systems
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems
Maxim Naumov
John Kim
Dheevatsa Mudigere
Srinivas Sridharan
Xiaodong Wang
...
Krishnakumar Nair
Isabel Gao
Bor-Yiing Su
Jiyan Yang
M. Smelyanskiy
GNN
43
83
0
20 Mar 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,821
0
17 Sep 2019
1