ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.08586
  4. Cited By
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster
  Scheduling

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

16 August 2024
Xinyi Zhang
Hanyu Zhao
Wencong Xiao
Xianyan Jia
Fei Xu
Yong Li
Wei Lin
Fangming Liu
ArXivPDFHTML

Papers citing "Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling"

3 / 3 papers shown
Title
Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge
Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge
Wenjiao Feng
Rongxing Xiao
Zonghang Li
Hongfang Yu
Gang Sun
Long Luo
M. Guizani
Qirong Ho
5
0
0
19 May 2025
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,821
0
17 Sep 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
  Minima
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
308
2,890
0
15 Sep 2016
1