ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.04478
  4. Cited By
Themis: A Network Bandwidth-Aware Collective Scheduling Policy for
  Distributed Training of DL Models

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

9 October 2021
Saeed Rashidi
William Won
Sudarshan Srinivasan
Srinivas Sridharan
T. Krishna
    GNN
ArXivPDFHTML

Papers citing "Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models"

17 / 17 papers shown
Title
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Mingyu Liang
Hiwot Tadese Kassa
Wenyin Fu
Brian Coutinho
Louis Feng
Christina Delimitrou
28
0
0
12 Apr 2025
FRED: Flexible REduction-Distribution Interconnect and Communication
  Implementation for Wafer-Scale Distributed Training of DNN Models
FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models
Saeed Rashidi
William Won
Sudarshan Srinivasan
Puneet Gupta
Tushar Krishna
28
0
0
28 Jun 2024
PALM: A Efficient Performance Simulator for Tiled Accelerators with
  Large-scale Model Training
PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training
Jiahao Fang
Huizheng Wang
Qize Yang
Dehao Kong
Xu Dai
Jinyi Deng
Yang Hu
Shouyi Yin
30
1
0
06 Jun 2024
Towards a Flexible and High-Fidelity Approach to Distributed DNN
  Training Emulation
Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
Banruo Liu
M. Ojewale
Yuhan Ding
Marco Canini
31
1
0
05 May 2024
PID-Comm: A Fast and Flexible Collective Communication Framework for
  Commodity Processing-in-DIMM Devices
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices
Si Ung Noh
Junguk Hong
Chaemin Lim
Seong-Yeol Park
Jeehyun Kim
Hanjun Kim
Youngsok Kim
Jinho Lee
34
6
0
13 Apr 2024
GPU Cluster Scheduling for Network-Sensitive Deep Learning
GPU Cluster Scheduling for Network-Sensitive Deep Learning
Aakash Sharma
Vivek M. Bhasi
Sonali Singh
G. Kesidis
M. Kandemir
Chita R. Das
23
3
0
29 Jan 2024
vTrain: A Simulation Framework for Evaluating Cost-effective and
  Compute-optimal Large Language Model Training
vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training
Jehyeon Bang
Yujeong Choi
Myeongwoo Kim
Yongdeok Kim
Minsoo Rhu
27
15
0
27 Nov 2023
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model
  Acceleration on Distributed Systems
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Samuel Hsia
Alicia Golden
Bilge Acun
Newsha Ardalani
Zach DeVito
Gu-Yeon Wei
David Brooks
Carole-Jean Wu
MoE
43
9
0
04 Oct 2023
Isolated Scheduling for Distributed Training Tasks in GPU Clusters
Isolated Scheduling for Distributed Training Tasks in GPU Clusters
Xinchi Han
Weihao Jiang
Peirui Cao
Qinwei Yang
Yunzhuo Liu
Shuyao Qi
Sheng-Yuan Lin
Shi-Ming Zhao
16
1
0
10 Aug 2023
Optimizing Distributed ML Communication with Fused
  Computation-Collective Operations
Optimizing Distributed ML Communication with Fused Computation-Collective Operations
Kishore Punniyamurthy
Khaled Hamidouche
Bradford M. Beckmann
FedML
26
8
0
11 May 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed
  Machine Learning
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
27
6
0
11 Apr 2023
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems
  for Large-model Training at Scale
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
William Won
Taekyung Heo
Saeed Rashidi
Srinivas Sridharan
Sudarshan Srinivasan
T. Krishna
36
43
0
24 Mar 2023
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep
  Learning Training
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training
D. Kadiyala
Saeed Rashidi
Taekyung Heo
Abhimanyu Bambhaniya
T. Krishna
Alexandros Daglis
VLM
24
9
0
30 Nov 2022
Impact of RoCE Congestion Control Policies on Distributed Training of
  DNNs
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
Tarannum Khan
Saeed Rashidi
Srinivas Sridharan
Pallavi Shurpali
Aditya Akella
T. Krishna
OOD
26
11
0
22 Jul 2022
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network
  Accelerators
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
Lois Orosa
Skanda Koppula
Yaman Umuroglu
Konstantinos Kanellopoulos
Juan Gómez Luna
Michaela Blott
K. Vissers
O. Mutlu
40
4
0
04 Feb 2022
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology
  Optimization for Distributed Training of Large AI Models
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
William Won
Saeed Rashidi
Sudarshan Srinivasan
T. Krishna
AI4CE
19
7
0
24 Sep 2021
Google's Neural Machine Translation System: Bridging the Gap between
  Human and Machine Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,743
0
26 Sep 2016
1