ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.08125
  4. Cited By
Oobleck: Resilient Distributed Training of Large Models Using Pipeline
  Templates

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

15 September 2023
Insu Jang
Zhenning Yang
Zhen Zhang
Xin Jin
Mosharaf Chowdhury
    MoE
    AI4CE
    OODD
ArXivPDFHTML

Papers citing "Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates"

24 / 24 papers shown
Title
Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training
Daiyaan Arfeen
Dheevatsa Mudigere
Ankit More
Bhargava Gopireddy
Ahmet Inci
G. R. Ganger
23
0
0
08 Apr 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Yuhang Zhang
Yixuan Wang
Shouda Liu
MLLM
MoE
67
1
0
31 Mar 2025
Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack
Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack
Chenxi Dai
Lin Lu
Pan Zhou
50
0
0
22 Feb 2025
Orthogonal Calibration for Asynchronous Federated Learning
Jiayun Zhang
Shuheng Li
Haiyu Huang
Xiaofan Yu
Rajesh K. Gupta
Jingbo Shang
FedML
65
0
0
21 Feb 2025
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale
  Models via Malleable Data and Model Parallelization
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jiawen Niu
Y. Wang
Hailin Zhang
Xiaonan Nie
Bin Cui
MoMe
33
2
0
17 Oct 2024
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale
  Hybrid-Parallel Training
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Tianyuan Wu
Wei Wang
Yinghao Yu
Siran Yang
Wenchao Wu
Qinkai Duan
Guodong Yang
Jiamang Wang
Lin Qu
Liping Zhang
35
6
0
16 Oct 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent
  Approach
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach
Rory Young
Nicolas Pugeault
AAML
59
0
0
14 Oct 2024
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng
Chi Zhang
Zilingfeng Ye
Xibin Wu
Wang Zhang
Ru Zhang
Size Zheng
Haibin Lin
Chuan Wu
AI4CE
36
71
0
28 Sep 2024
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
73
8
0
29 Jul 2024
Enabling Elastic Model Serving with MultiWorld
Enabling Elastic Model Serving with MultiWorld
Myungjin Lee
Akshay Jajoo
Ramana Rao Kompella
MoE
63
0
0
12 Jul 2024
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models
  with Adaptive Expert Placement
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Yongji Wu
Wenjie Qu
Tianyang Tao
Zhuang Wang
Wei Bai
Zhuohao Li
Yuan Tian
Jiaheng Zhang
Matthew Lentz
Danyang Zhuo
63
3
0
05 Jul 2024
Resource Allocation and Workload Scheduling for Large-Scale Distributed
  Deep Learning: A Survey
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Chengming Li
Victor C. M. Leung
Yanyi Guo
Xiping Hu
45
3
0
12 Jun 2024
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs
  Amid Failures
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Swapnil Gandhi
Mark Zhao
Athinagoras Skiadopoulos
Christos Kozyrakis
AI4CE
GNN
43
8
0
22 May 2024
Toward Cross-Layer Energy Optimizations in Machine Learning Systems
Toward Cross-Layer Energy Optimizations in Machine Learning Systems
Jae-Won Chung
Mosharaf Chowdhury
37
2
0
10 Apr 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A
  Comprehensive Survey
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
37
6
0
09 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible
  Instances
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan
Ziang Song
Xupeng Miao
Xiaoli Xi
Dahua Lin
Harry Xu
Minjia Zhang
Zhihao Jia
49
10
0
21 Mar 2024
Characterization of Large Language Model Development in the Datacenter
Characterization of Large Language Model Development in the Datacenter
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
...
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
53
43
0
12 Mar 2024
Unicron: Economizing Self-Healing LLM Training at Scale
Unicron: Economizing Self-Healing LLM Training at Scale
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
19
15
0
30 Dec 2023
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable
  Tensor Collections
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
Marcel Wagenlander
Guo Li
Bo Zhao
Luo Mai
Peter R. Pietzuch
35
7
0
08 Dec 2023
Exploring the Robustness of Decentralized Training for Large Language
  Models
Exploring the Robustness of Decentralized Training for Large Language Models
Lin Lu
Chenxi Dai
Wangcheng Tao
Binhang Yuan
Yanan Sun
Pan Zhou
31
1
0
01 Dec 2023
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed
  Machine Learning
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
27
6
0
11 Apr 2023
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Sanjith Athlur
Nitika Saran
Muthian Sivathanu
Ramachandran Ramjee
Nipun Kwatra
GNN
31
80
0
07 Nov 2021
Chimera: Efficiently Training Large-Scale Neural Networks with
  Bidirectional Pipelines
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
Shigang Li
Torsten Hoefler
GNN
AI4CE
LRM
77
131
0
14 Jul 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
177
414
0
18 Jan 2021
1