ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.04473
  4. Cited By
Efficient Large-Scale Language Model Training on GPU Clusters Using
  Megatron-LM

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

9 April 2021
Deepak Narayanan
M. Shoeybi
Jared Casper
P. LeGresley
M. Patwary
V. Korthikanti
Dmitri Vainbrand
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
    MoE
ArXivPDFHTML

Papers citing "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM"

50 / 366 papers shown
Title
Retrieval-augmented code completion for local projects using large
  language models
Retrieval-augmented code completion for local projects using large language models
Marko Hostnik
Marko Robnik-Sikonja
RALM
32
0
0
09 Aug 2024
Scaling Deep Learning Computation over the Inter-Core Connected
  Intelligence Processor with T10
Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10
Yiqi Liu
Yuqi Xue
Yu Cheng
Lingxiao Ma
Ziming Miao
Jilong Xue
Jian Huang
GNN
21
1
0
09 Aug 2024
Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency,
  Performance, and Adversarial Robustness
Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness
Xiaojing Fan
Chunliang Tao
AAML
39
28
0
08 Aug 2024
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble
  Exploitation
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
Weiqi Feng
Yangrui Chen
Shaoyu Wang
Size Zheng
Haibin Lin
Minlan Yu
MLLM
AI4CE
40
4
0
07 Aug 2024
UnifiedNN: Efficient Neural Network Training on the Cloud
UnifiedNN: Efficient Neural Network Training on the Cloud
Xingyu Lou
Arthi Padmanabhan
Spyridon Mastorakis
FedML
43
0
0
02 Aug 2024
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
73
8
0
29 Jul 2024
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Borui Wan
Mingji Han
Yiyao Sheng
Zhichao Lai
Mofan Zhang
...
Yanghua Peng
Xin Liu
Chuan Wu
Xin Liu
Chuan Wu
39
3
0
29 Jul 2024
u-$\mu$P: The Unit-Scaled Maximal Update Parametrization
u-μ\muμP: The Unit-Scaled Maximal Update Parametrization
Charlie Blake
C. Eichenberg
Josef Dean
Lukas Balles
Luke Y. Prince
Bjorn Deiseroth
Andres Felipe Cruz Salinas
Carlo Luschi
Samuel Weinbach
Douglas Orr
53
9
0
24 Jul 2024
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long
  Sequences Training
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
Cheng Luo
Jiawei Zhao
Zhuoming Chen
Beidi Chen
A. Anandkumar
26
3
0
22 Jul 2024
Performance Modeling and Workload Analysis of Distributed Large Language
  Model Training and Inference
Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference
Joyjit Kundu
Wenzhe Guo
Ali BanaGozar
Udari De Alwis
Sourav Sengupta
Puneet Gupta
Arindam Mallik
42
3
0
19 Jul 2024
Integrated Hardware Architecture and Device Placement Search
Integrated Hardware Architecture and Device Placement Search
Irene Wang
Jakub Tarnawski
Amar Phanishayee
Divya Mahajan
33
1
0
18 Jul 2024
Investigating Low-Rank Training in Transformer Language Models:
  Efficiency and Scaling Analysis
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
Xiuying Wei
Skander Moalla
Razvan Pascanu
Çağlar Gülçehre
33
1
0
13 Jul 2024
SoupLM: Model Integration in Large Language and Multi-Modal Models
SoupLM: Model Integration in Large Language and Multi-Modal Models
Yue Bai
Zichen Zhang
Jiasen Lu
Yun Fu
MoMe
32
1
0
11 Jul 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Guanqiao Qu
Qiyuan Chen
Wei Wei
Zheng Lin
Xianhao Chen
Kaibin Huang
42
43
0
09 Jul 2024
The infrastructure powering IBM's Gen AI model development
The infrastructure powering IBM's Gen AI model development
Talia Gershon
Seetharami R. Seelam
Brian M. Belgodere
Milton Bonilla
Lan Hoang
...
Ruchir Puri
Dakshi Agrawal
Drew Thorstensen
Joel Belog
Brent Tang
VLM
40
5
0
07 Jul 2024
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training
Xingyu Xie
Zhijie Lin
Kim-Chuan Toh
Pan Zhou
32
2
0
05 Jul 2024
On the Performance and Memory Footprint of Distributed Training: An
  Empirical Study on Transformers
On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers
Zhengxian Lu
Fangyu Wang
Zhiwei Xu
Fei Yang
Tao Li
29
1
0
02 Jul 2024
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested
  Large Language Models
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
Parsa Kavehzadeh
Mohammadreza Pourreza
Mojtaba Valipour
Tinashu Zhu
Haoli Bai
Ali Ghodsi
Boxing Chen
Mehdi Rezagholizadeh
32
0
0
02 Jul 2024
$\text{Memory}^3$: Language Modeling with Explicit Memory
Memory3\text{Memory}^3Memory3: Language Modeling with Explicit Memory
Hongkang Yang
Zehao Lin
Wenjin Wang
Hao Wu
Zhiyu Li
...
Yu Yu
Kai Chen
Feiyu Xiong
Linpeng Tang
Weinan E
50
11
0
01 Jul 2024
WallFacer: Guiding Transformer Model Training Out of the Long-Context
  Dark Forest with N-body Problem
WallFacer: Guiding Transformer Model Training Out of the Long-Context Dark Forest with N-body Problem
Ziming Liu
Shaoyu Wang
Shenggan Cheng
Zhongkai Zhao
Xuanlei Zhao
James Demmel
Yang You
37
0
0
30 Jun 2024
Parm: Efficient Training of Large Sparsely-Activated Models with
  Dedicated Schedules
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules
Xinglin Pan
Wenxiang Lin
S. Shi
Xiaowen Chu
Weinong Sun
Bo Li
MoE
49
3
0
30 Jun 2024
FRED: Flexible REduction-Distribution Interconnect and Communication
  Implementation for Wafer-Scale Distributed Training of DNN Models
FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models
Saeed Rashidi
William Won
Sudarshan Srinivasan
Puneet Gupta
Tushar Krishna
30
0
0
28 Jun 2024
Universal Checkpointing: Efficient and Flexible Checkpointing for Large
  Scale Distributed Training
Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Xinyu Lian
Sam Ade Jacobs
Lev Kurilenko
Masahiro Tanaka
Stas Bekman
Olatunji Ruwase
Minjia Zhang
OffRL
23
8
0
27 Jun 2024
On Scaling Up 3D Gaussian Splatting Training
On Scaling Up 3D Gaussian Splatting Training
Hexu Zhao
Haoyang Weng
Daohan Lu
Ang Li
Jinyang Li
Aurojit Panda
Saining Xie
3DGS
37
12
0
26 Jun 2024
GraphPipe: Improving Performance and Scalability of DNN Training with
  Graph Pipeline Parallelism
GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Byungsoo Jeon
Mengdi Wu
Shiyi Cao
Sunghyun Kim
Sunghyun Park
...
Xupeng Miao
Mohammad Alizadeh
G. R. Ganger
Tianqi Chen
Zhihao Jia
GNN
AI4CE
61
5
0
24 Jun 2024
Scalable Artificial Intelligence for Science: Perspectives, Methods and
  Exemplars
Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars
Wesley Brewer
Aditya Kashi
Sajal Dash
A. Tsaris
Junqi Yin
Mallikarjun Shankar
Feiyi Wang
43
0
0
24 Jun 2024
Building on Efficient Foundations: Effectively Training LLMs with
  Structured Feedforward Layers
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
Xiuying Wei
Skander Moalla
Razvan Pascanu
Çağlar Gülçehre
32
0
0
24 Jun 2024
AI-coupled HPC Workflow Applications, Middleware and Performance
AI-coupled HPC Workflow Applications, Middleware and Performance
Wes Brewer
Ana Gainaru
Frédéric Suter
Feiyi Wang
M. Emani
S. Jha
30
10
0
20 Jun 2024
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
Ke Cheng
Wen Hu
Zhi Wang
Hongen Peng
Jianguo Li
Sheng Zhang
54
7
0
19 Jun 2024
LiLiuM: eBay's Large Language Models for e-commerce
LiLiuM: eBay's Large Language Models for e-commerce
Christian Herold
Michael Kozielski
Leonid Ekimov
Pavel Petrushkov
P. Vandenbussche
Shahram Khadivi
35
1
0
17 Jun 2024
Nemotron-4 340B Technical Report
Nemotron-4 340B Technical Report
Nvidia
:
Bo Adler
Niket Agarwal
Ashwath Aithal
...
Jimmy Zhang
Jing Zhang
Vivienne Zhang
Yian Zhang
Chen Zhu
49
56
0
17 Jun 2024
Optimizing Large Model Training through Overlapped Activation Recomputation
Optimizing Large Model Training through Overlapped Activation Recomputation
Ping Chen
Wenjie Zhang
Shuibing He
Yingjie Gu
Zhuwei Peng
...
Yi Zheng
Zhefeng Wang
Yanlong Yin
Gang Chen
Gang Chen
35
5
0
13 Jun 2024
Resource Allocation and Workload Scheduling for Large-Scale Distributed
  Deep Learning: A Survey
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Chengming Li
Victor C. M. Leung
Yanyi Guo
Xiping Hu
45
3
0
12 Jun 2024
An Empirical Study of Mamba-based Language Models
An Empirical Study of Mamba-based Language Models
R. Waleffe
Wonmin Byeon
Duncan Riach
Brandon Norick
V. Korthikanti
...
Vartika Singh
Jared Casper
Jan Kautz
M. Shoeybi
Bryan Catanzaro
61
65
0
12 Jun 2024
Sustainable self-supervised learning for speech representations
Sustainable self-supervised learning for speech representations
Luis Lugo
Valentin Vielzeuf
31
2
0
11 Jun 2024
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Zigeng Chen
Xinyin Ma
Gongfan Fang
Zhenxiong Tan
Xinchao Wang
52
7
0
11 Jun 2024
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel
  Fusion
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
Li-Wen Chang
Yiyuan Ma
Qi Hou
Chengquan Jiang
Ningxin Zheng
...
Zuquan Song
Ziheng Jiang
Yanghua Peng
Xuanzhe Liu
Xin Liu
41
21
0
11 Jun 2024
PALM: A Efficient Performance Simulator for Tiled Accelerators with
  Large-scale Model Training
PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training
Jiahao Fang
Huizheng Wang
Qize Yang
Dehao Kong
Xu Dai
Jinyi Deng
Yang Hu
Shouyi Yin
30
1
0
06 Jun 2024
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large
  Language Model Training
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training
Ao Sun
Weilin Zhao
Xu Han
Cheng Yang
Zhiyuan Liu
Chuan Shi
Maosong Sun
31
7
0
05 Jun 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun
Ziming Huang
Hanyu Zhao
Wencong Xiao
Xinyi Zhang
Yong Li
Wei Lin
35
45
0
05 Jun 2024
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts
  Language Models
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
Tianwen Wei
Bo Zhu
Liang Zhao
Cheng Cheng
Biye Li
...
Yutuan Ma
Rui Hu
Shuicheng Yan
Han Fang
Yahui Zhou
MoE
46
24
0
03 Jun 2024
Hybrid-Parallel: Achieving High Performance and Energy Efficient
  Distributed Inference on Robots
Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots
Zekai Sun
Xiuxian Guan
Junming Wang
Haoze Song
Yuhao Qing
Tianxiang Shen
Dong Huang
Fangming Liu
Heming Cui
34
0
0
29 May 2024
Pipette: Automatic Fine-grained Large Language Model Training
  Configurator for Real-World Clusters
Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters
Jinkyu Yim
Jaeyong Song
Yerim Choi
Jaebeen Lee
Jaewon Jung
Hongsun Jang
Jinho Lee
21
0
0
28 May 2024
Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ
  Transformer Inference
Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference
Shengyuan Ye
Jiangsu Du
Liekang Zeng
Wenzhong Ou
Xiaowen Chu
Yutong Lu
Xu Chen
36
17
0
27 May 2024
Triple Preference Optimization: Achieving Better Alignment with Less
  Data in a Single Step Optimization
Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization
Amir Saeidi
Shivanshu Verma
Aswin Rrv
Chitta Baral
32
0
0
26 May 2024
Pipeline Parallelism with Controllable Memory
Pipeline Parallelism with Controllable Memory
Penghui Qi
Xinyi Wan
Nyamdavaa Amar
Min Lin
22
6
0
24 May 2024
A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings
A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings
Madison Threadgill
A. Gerstlauer
43
1
0
23 May 2024
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs
  Amid Failures
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Swapnil Gandhi
Mark Zhao
Athinagoras Skiadopoulos
Christos Kozyrakis
AI4CE
GNN
43
8
0
22 May 2024
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in
  Large-Scale AI Models
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models
Zhaojian Yu
Yinghao Wu
Zhuotao Deng
Yansong Tang
Xiao-Ping Zhang
46
2
0
21 May 2024
Large Language Models for Education: A Survey
Large Language Models for Education: A Survey
Hanyi Xu
Wensheng Gan
Zhenlian Qi
Jiayang Wu
Philip S. Yu
AI4Ed
ELM
62
14
0
12 May 2024
Previous
12345678
Next