Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.08586
Cited By
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
16 August 2024
Xinyi Zhang
Hanyu Zhao
Wencong Xiao
Xianyan Jia
Fei Xu
Yong Li
Wei Lin
Fangming Liu
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling"
3 / 3 papers shown
Title
Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge
Wenjiao Feng
Rongxing Xiao
Zonghang Li
Hongfang Yu
Gang Sun
Long Luo
M. Guizani
Qirong Ho
5
0
0
19 May 2025
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,821
0
17 Sep 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
308
2,890
0
15 Sep 2016
1