Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.00134
Cited By
Unicron: Economizing Self-Healing LLM Training at Scale
30 December 2023
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unicron: Economizing Self-Healing LLM Training at Scale"
11 / 11 papers shown
Title
Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge
Wenjiao Feng
Rongxing Xiao
Zonghang Li
Hongfang Yu
Gang Sun
Long Luo
Mohsen Guizani
Qirong Ho
9
0
0
19 May 2025
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Yangtao Deng
Xiang Shi
Zhuo Jiang
X. Zhang
Lei Zhang
...
Fuliang Li
Shuguang Wang
H. Lin
Jianxi Ye
Minlan Yu
LRM
163
2
0
04 Nov 2024
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
Shri Kiran Srinivasan
Kalyan Saladi
Carole-Jean Wu
172
7
0
29 Oct 2024
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
73
8
0
29 Jul 2024
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Yongji Wu
Wenjie Qu
Tianyang Tao
Zhuang Wang
Wei Bai
Zhuohao Li
Yuan Tian
Jiaheng Zhang
Matthew Lentz
Danyang Zhuo
69
3
0
05 Jul 2024
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
Avinash Maurya
Robert Underwood
M. Rafique
Franck Cappello
Bogdan Nicolae
21
14
0
15 Jun 2024
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Swapnil Gandhi
Mark Zhao
Athinagoras Skiadopoulos
Christos Kozyrakis
AI4CE
GNN
49
1
0
22 May 2024
ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling
Lingxi Zhang
Yue Yu
Kuan-Chieh Jackson Wang
Chao Zhang
VLM
RALM
30
4
0
21 Feb 2024
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
William Won
Suvinay Subramanian
Sudarshan Srinivasan
A. Durg
Samvit Kaul
Swati Gupta
Tushar Krishna
27
6
0
11 Apr 2023
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Sanjith Athlur
Nitika Saran
Muthian Sivathanu
Ramachandran Ramjee
Nipun Kwatra
GNN
33
80
0
07 Nov 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,826
0
17 Sep 2019
1