Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1901.05758
Cited By
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
17 January 2019
Myeongjae Jeon
Shivaram Venkataraman
Amar Phanishayee
Junjie Qian
Wencong Xiao
Fan Yang
GNN
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads"
32 / 32 papers shown
Title
LithOS: An Operating System for Efficient Machine Learning on GPUs
Patrick H. Coppock
Brian Zhang
Eliot H. Solomon
Vasilis Kypriotis
Leon Yang
Bikash Sharma
Dan Schatzberg
Todd C. Mowry
Dimitrios Skarlatos
40
0
0
21 Apr 2025
Orthogonal Calibration for Asynchronous Federated Learning
Jiayun Zhang
Shuheng Li
Haiyu Huang
Xiaofan Yu
Rajesh K. Gupta
Jingbo Shang
FedML
65
0
0
21 Feb 2025
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
Shri Kiran Srinivasan
Kalyan Saladi
Carole-Jean Wu
193
7
0
29 Oct 2024
A Survey on Failure Analysis and Fault Injection in AI Systems
Guangba Yu
Gou Tan
Haojia Huang
Zhenyu Zhang
Pengfei Chen
Roberto Natella
Zibin Zheng
51
4
0
28 Jun 2024
PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
Zhuobin Huang
Rong Chen
Yingyi Hao
Rong Chen
Mingcong Han
Jinyu Gu
Haibo Chen
34
4
0
20 May 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
39
6
0
09 Apr 2024
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Chunyu Xue
Weihao Cui
Han Zhao
Quan Chen
Shulai Zhang
Peng Yang
Jing Yang
Shaobo Li
Minyi Guo
56
2
0
24 Mar 2024
Towards providing reliable job completion time predictions using PCS
Abdullah Bin Faisal
Noah Martin
Hafiz Mohsin Bashir
Swaminathan Lamelas
Fahad R. Dogar
22
0
0
18 Jan 2024
Energy-Efficient GPU Clusters Scheduling for Deep Learning
Diandian Gu
Xintong Xie
Gang Huang
Xin Jin
Xuanzhe Liu
GNN
24
7
0
13 Apr 2023
Making AI Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of AI Models
Pengfei Li
Jianyi Yang
M. A. Islam
Shaolei Ren
91
123
0
06 Apr 2023
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
Yihao Zhao
Xin Liu
Shufan Liu
Xiang Li
Yibo Zhu
Gang Huang
Xuanzhe Liu
Xin Jin
35
11
0
24 Mar 2023
Kernel-as-a-Service: A Serverless Interface to GPUs
Nathan Pemberton
Anton Zabreyko
Zhoujie Ding
R. Katz
Joseph E. Gonzalez
29
8
0
15 Dec 2022
RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems
Alessandro Ottino
Joshua L. Benjamin
G. Zervas
30
7
0
28 Nov 2022
An Analysis of Collocation on GPUs for Deep Learning Training
Ties Robroek
Ehsan Yousefzadeh-Asl-Miandoab
Pınar Tözün
22
9
0
13 Sep 2022
EasyScale: Accuracy-consistent Elastic Training for Deep Learning
Mingzhen Li
Wencong Xiao
Biao Sun
Hanyu Zhao
Hailong Yang
...
Xianyan Jia
Yi Liu
Yong Li
Wei Lin
D. Qian
22
7
0
30 Aug 2022
FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models
Yunzhuo Liu
Bo Jiang
Tian Guo
Zimeng Huang
Wen-ping Ma
Xinbing Wang
Chenghu Zhou
24
9
0
28 Apr 2022
The MIT Supercloud Workload Classification Challenge
Benny J. Tang
Qiqi Chen
Matthew L. Weiss
Nathan C. Frey
Joseph McDonald
...
Lindsey McEvoy
Baolin Li
Devesh Tiwari
V. Gadepally
S. Samsi
19
2
0
12 Apr 2022
Pathways: Asynchronous Distributed Dataflow for ML
P. Barham
Aakanksha Chowdhery
J. Dean
Sanjay Ghemawat
Steven Hand
...
Parker Schuh
Ryan Sepassi
Laurent El Shafey
C. A. Thekkath
Yonghui Wu
GNN
MoE
45
126
0
23 Mar 2022
SpotLake: Diverse Spot Instance Dataset Archive Service
Sungjae Lee
Jaeil Hwang
Kyungyong Lee
4
12
0
07 Feb 2022
Benchmarking Resource Usage for Efficient Distributed Deep Learning
Nathan C. Frey
Baolin Li
Joseph McDonald
Dan Zhao
Michael Jones
David Bestor
Devesh Tiwari
V. Gadepally
S. Samsi
35
9
0
28 Jan 2022
GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge
Arthi Padmanabhan
Neil Agarwal
Anand Iyer
Ganesh Ananthanarayanan
Yuanchao Shu
Nikolaos Karianakis
G. Xu
Ravi Netravali
43
59
0
19 Jan 2022
Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing
Yiding Wang
D. Sun
Kai Chen
Fan Lai
Mosharaf Chowdhury
33
44
0
17 Jan 2022
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters
Chen Sun
Shenggui Li
Jinyue Wang
Jun Yu
54
47
0
08 Aug 2021
A Multi-Tenant Framework for Cloud Container Services
Chao Zheng
Qinghui Zhuang
Fei Guo
25
5
0
24 Mar 2021
CPU Scheduling in Data Centers Using Asynchronous Finite-Time Distributed Coordination Mechanisms
Andreas Grammenos
Themistoklis Charalambous
Evangelia Kalyvianaki
22
20
0
15 Jan 2021
CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Kiwan Maeng
Shivam Bharuka
Isabel Gao
M. C. Jeffrey
V. Saraph
...
Caroline Trippel
Jiyan Yang
Michael G. Rabbat
Brandon Lucia
Carole-Jean Wu
OffRL
26
31
0
05 Nov 2020
Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster
Ying Mao
Yuqi Fu
Wenjia Zheng
Long Cheng
Qingzhi Liu
Dingwen Tao
21
29
0
21 Oct 2020
VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware
Andrew Or
Haoyu Zhang
M. Freedman
17
9
0
20 Sep 2020
Spatial Sharing of GPU for Autotuning DNN models
Aditya Dhakal
Junguk Cho
Sameer G. Kulkarni
K. Ramakrishnan
P. Sharma
19
8
0
08 Aug 2020
DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training
Weiyan Wang
Cengguang Zhang
Liu Yang
Kai Chen
Kun Tan
29
12
0
07 Jul 2020
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers
Shijian Li
R. Walls
Tian Guo
31
23
0
07 Apr 2020
Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs
Qiang-qiang Wang
S. Shi
Canhui Wang
Xiaowen Chu
24
13
0
24 Feb 2020
1