Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.15627
Cited By
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
23 February 2024
Ziheng Jiang
Yanghua Peng
Yinmin Zhong
Qi Huang
Yangrui Chen
Zhi-Li Zhang
Size Zheng
Xiang Li
Cong Xie
Shibiao Nong
Yulu Jia
Sun He
Hongmin Chen
Zhihao Bai
Qi Hou
Shipeng Yan
Ding Zhou
Yiyao Sheng
Zhuo Jiang
Haohan Xu
Haoran Wei
Zhang Zhang
Pengfei Nie
Leqi Zou
Sida Zhao
Liang Xiang
Zherui Liu
Zhe Li
X. Jia
Jia-jun Ye
Xin Jin
Xin Liu
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs"
41 / 41 papers shown
Title
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Chenggang Zhao
Chengqi Deng
Chong Ruan
Damai Dai
Huazuo Gao
...
Wenfeng Liang
Ying He
Yishuo Wang
Yuxuan Liu
Y. X. Wei
MoE
41
0
0
14 May 2025
Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
Jinsun Yoo
ChonLam Lao
Lianjie Cao
Bob Lantz
Minlan Yu
Tushar Krishna
Puneet Sharma
52
0
0
29 Apr 2025
Accelerating Mixture-of-Experts Training with Adaptive Expert Replication
Athinagoras Skiadopoulos
Mark Zhao
Swapnil Gandhi
Thomas Norrie
Shrijeet Mukherjee
Christos Kozyrakis
MoE
91
0
0
28 Apr 2025
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
Yinmin Zhong
Zili Zhang
Xiaoniu Song
Hanpeng Hu
Chao Jin
...
Changyi Wan
Hongyu Zhou
Yimin Jiang
Yibo Zhu
Daxin Jiang
OffRL
AI4TS
57
0
0
22 Apr 2025
NNTile: a machine learning framework capable of training extremely large GPT language models on a single node
A. Mikhalev
Aleksandr Katrutsa
Konstantin Sozykin
Ivan Oseledets
35
0
0
17 Apr 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Wenjie Qu
Yansen Wang
Shouda Liu
MLLM
MoE
67
1
0
31 Mar 2025
Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
Yihong Jin
Ze Yang
Xinhe Xu
Yihan Zhang
Shuyang Ji
56
4
0
15 Mar 2025
Characterizing GPU Resilience and Impact on AI/HPC Systems
Shengkun Cui
Archit Patke
Ziheng Chen
Aditya Ranjan
Hung Nguyen
...
Chandra Narayanaswami
Daby M. Sow
C. Martino
Zbigniew T. Kalbarczyk
R. Iyer
39
0
0
14 Mar 2025
ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
Hao Ge
Junda Feng
Qi Huang
Fangcheng Fu
Xiaonan Nie
Lei Zuo
Yanghua Peng
Bin Cui
Xin Liu
47
2
0
28 Feb 2025
Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
Siddharth Singh
Prajwal Singhania
Aditya K. Ranjan
John Kirchenbauer
Jonas Geiping
...
Abhimanyu Hans
Manli Shu
Aditya Tomar
Tom Goldstein
A. Bhatele
102
2
0
12 Feb 2025
mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training
Xudong Liao
Yijun Sun
Han Tian
Xinchen Wan
Yilun Jin
...
Guyue Liu
Ying Zhang
Xiaofeng Ye
Yiming Zhang
Kai Chen
MoE
39
0
0
08 Jan 2025
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
Lang Xu
Quentin G. Anthony
Jacob Hatef
Hari Subramoni
Hari Subramoni
Dhabaleswar K.
Panda
37
0
0
08 Jan 2025
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
119
1
0
18 Dec 2024
Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge
Yuhe Ji
Yilun Liu
Feiyu Yao
Minggui He
Shimin Tao
...
Xinhua Yang
Weibin Meng
Yuming Xie
Boxing Chen
Hao Yang
90
3
0
02 Dec 2024
Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution
Haiquan Wang
Chaoyi Ruan
Jia He
Jiaqi Ruan
Chengjie Tang
Xiaosong Ma
Cheng-rong Li
73
1
0
24 Nov 2024
Photon: Federated LLM Pre-Training
Lorenzo Sani
Alex Iacob
Zeyu Cao
Royson Lee
Bill Marino
...
Dongqi Cai
Zexi Li
Wanru Zhao
Xinchi Qiu
Nicholas D. Lane
AI4CE
36
7
0
05 Nov 2024
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
Shri Kiran Srinivasan
Kalyan Saladi
Carole-Jean Wu
166
7
0
29 Oct 2024
TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness
Ankita Dutta
Nabendu Chaki
Rajat K. De
29
0
0
18 Oct 2024
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jiawen Niu
Yijiao Wang
Hailin Zhang
Xiaonan Nie
Bin Cui
MoMe
41
2
0
17 Oct 2024
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng
Chi Zhang
Zilingfeng Ye
Xibin Wu
Wang Zhang
Ru Zhang
Size Zheng
Haibin Lin
Chuan Wu
AI4CE
39
71
0
28 Sep 2024
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
Mo Sun
Zihan Yang
Changyue Liao
Yingtao Li
Fei Wu
Zeke Wang
60
1
0
02 Sep 2024
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
Wei An
Xiao Bi
Guanting Chen
Shanhuang Chen
Chengqi Deng
...
Chenggang Zhao
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Yuheng Zou
41
6
0
26 Aug 2024
Cross-Domain Foundation Model Adaptation: Pioneering Computer Vision Models for Geophysical Data Analysis
Zhixiang Guo
Xinming Wu
Luming Liang
Hanlin Sheng
Nuo Chen
Zhengfa Bi
AI4CE
57
1
0
22 Aug 2024
Demystifying the Communication Characteristics for Distributed Transformer Models
Quentin G. Anthony
Benjamin Michalowicz
Jacob Hatef
Lang Xu
Mustafa Abduljabbar
Hari Subramoni
Hari Subramoni
D. Panda
AI4CE
38
2
0
19 Aug 2024
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
Weiqi Feng
Yangrui Chen
Shaoyu Wang
Size Zheng
Haibin Lin
Minlan Yu
MLLM
AI4CE
42
4
0
07 Aug 2024
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
73
8
0
29 Jul 2024
LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models
Achintya Gopal
Nicholas Wai Long Lau
Eva Adelina Susanto
Chi Lok Yu
Aditya Paul
ELM
25
7
0
27 Jul 2024
Enabling Elastic Model Serving with MultiWorld
Myungjin Lee
Akshay Jajoo
Ramana Rao Kompella
MoE
68
0
0
12 Jul 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Guanqiao Qu
Qiyuan Chen
Wei Wei
Zheng Lin
Xianhao Chen
Kaibin Huang
42
43
0
09 Jul 2024
The infrastructure powering IBM's Gen AI model development
Talia Gershon
Seetharami R. Seelam
Brian M. Belgodere
Milton Bonilla
Lan Hoang
...
Ruchir Puri
Dakshi Agrawal
Drew Thorstensen
Joel Belog
Brent Tang
VLM
40
5
0
07 Jul 2024
A Survey on Failure Analysis and Fault Injection in AI Systems
Guangba Yu
Gou Tan
Haojia Huang
Zhenyu Zhang
Pengfei Chen
Roberto Natella
Zibin Zheng
36
3
0
28 Jun 2024
Optimizing Large Model Training through Overlapped Activation Recomputation
Ping Chen
Wenjie Zhang
Shuibing He
Yingjie Gu
Zhuwei Peng
...
Yi Zheng
Zhefeng Wang
Yanlong Yin
Gang Chen
Gang Chen
35
5
0
13 Jun 2024
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Li-Wen Chang
Yiyuan Ma
Qi Hou
Chengquan Jiang
Ningxin Zheng
...
Zuquan Song
Ziheng Jiang
Yanghua Peng
Xuanzhe Liu
Xin Liu
41
22
0
11 Jun 2024
Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
Ray Cao
Sherry Luo
Steve Gan
Sujeeth Jinesh
18
0
0
08 Jun 2024
PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models
Jiannan Wang
Jiarui Fang
Aoyu Li
PengCheng Yang
AI4CE
64
3
0
23 May 2024
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Swapnil Gandhi
Mark Zhao
Athinagoras Skiadopoulos
Christos Kozyrakis
AI4CE
GNN
49
1
0
22 May 2024
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models
Zhaojian Yu
Yinghao Wu
Zhuotao Deng
Yansong Tang
Xiao-Ping Zhang
52
2
0
21 May 2024
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
Jiarui Fang
Shangchun Zhao
40
15
0
13 May 2024
Characterization of Large Language Model Development in the Datacenter
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
...
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
56
43
0
12 Mar 2024
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,826
0
17 Sep 2019
1