Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2008.12260
Cited By
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
27 August 2020
Aurick Qiao
Sang Keun Choe
Suhas Jayaram Subramanya
Willie Neiswanger
Qirong Ho
Hao Zhang
G. Ganger
Eric Xing
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning"
22 / 22 papers shown
Title
Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge
Wenjiao Feng
Rongxing Xiao
Zonghang Li
Hongfang Yu
Gang Sun
Long Luo
Mohsen Guizani
Qirong Ho
17
0
0
19 May 2025
Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation
Jianxing Qin
Jingrong Chen
Xinhao Kong
Yongji Wu
Liang Luo
Ziyi Wang
Ying Zhang
Tingjun Chen
Alvin R. Lebeck
Danyang Zhuo
175
0
0
02 May 2025
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters
Ziyue Luo
Jia-Wei Liu
Myungjin Lee
Ness B. Shroff
44
0
0
09 Jan 2025
How to Rent GPUs on a Budget
Zhouzi Li
Benjamin Berg
Arpan Mukhopadhyay
Mor Harchol-Balter
28
0
0
21 Jun 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
39
6
0
09 Apr 2024
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Chunyu Xue
Weihao Cui
Han Zhao
Quan Chen
Shulai Zhang
Peng Yang
Jing Yang
Shaobo Li
Minyi Guo
56
2
0
24 Mar 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Yuting Yang
Andrea Merlina
Weijia Song
Tiancheng Yuan
Ken Birman
Roman Vitenberg
49
0
0
27 Feb 2024
Towards providing reliable job completion time predictions using PCS
Abdullah Bin Faisal
Noah Martin
Hafiz Mohsin Bashir
Swaminathan Lamelas
Fahad R. Dogar
22
0
0
18 Jan 2024
Accelerating Distributed ML Training via Selective Synchronization
S. Tyagi
Martin Swany
FedML
46
3
0
16 Jul 2023
GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training
S. Tyagi
Martin Swany
35
4
0
20 May 2023
Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning
Hailiang Zhao
Shuiguang Deng
Zhengzhe Xiang
Xueqiang Yan
Jianwei Yin
Schahram Dustdar
Albert Y. Zomaya
38
1
0
11 May 2023
Energy-Efficient GPU Clusters Scheduling for Deep Learning
Diandian Gu
Xintong Xie
Gang Huang
Xin Jin
Xuanzhe Liu
GNN
26
7
0
13 Apr 2023
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
Yihao Zhao
Xin Liu
Shufan Liu
Xiang Li
Yibo Zhu
Gang Huang
Xuanzhe Liu
Xin Jin
40
11
0
24 Mar 2023
Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm
Yihong Li
Xiaoxi Zhang
Tian Zeng
Jingpu Duan
Chuanxi Wu
Di Wu
Xu Chen
31
16
0
01 Feb 2023
EasyScale: Accuracy-consistent Elastic Training for Deep Learning
Mingzhen Li
Wencong Xiao
Biao Sun
Hanyu Zhao
Hailong Yang
...
Xianyan Jia
Yi Liu
Yong Li
Wei Lin
D. Qian
24
7
0
30 Aug 2022
Learning to Schedule Multi-Server Jobs with Fluctuated Processing Speeds
Hailiang Zhao
Shuiguang Deng
Feiyi Chen
Jianwei Yin
Schahram Dustdar
Albert Y. Zomaya
49
5
0
09 Apr 2022
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
Weiyang Wang
Moein Khazraee
Zhizhen Zhong
M. Ghobadi
Zhihao Jia
Dheevatsa Mudigere
Ying Zhang
A. Kewitsch
39
85
0
01 Feb 2022
Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem
Cheng Tan
Zhichao Li
Jian Zhang
Yunyin Cao
Sikai Qi
Zherui Liu
Yibo Zhu
Chuanxiong Guo
31
34
0
18 Sep 2021
On the Future of Cloud Engineering
David Bermbach
A. Chandra
C. Krintz
A. Gokhale
Aleksander Slominski
L. Thamsen
Everton Cavalcante
Tian Guo
Ivona Brandić
R. Wolski
43
23
0
19 Aug 2021
BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes
Zhengchun Liu
R. Kettimuthu
M. Papka
Ian Foster
34
3
0
22 Jun 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,836
0
17 Sep 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
312
2,896
0
15 Sep 2016
1