ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.04473
  4. Cited By
Efficient Large-Scale Language Model Training on GPU Clusters Using
  Megatron-LM

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

9 April 2021
Deepak Narayanan
M. Shoeybi
Jared Casper
P. LeGresley
M. Patwary
V. Korthikanti
Dmitri Vainbrand
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
    MoE
ArXivPDFHTML

Papers citing "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM"

50 / 366 papers shown
Title
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
147
0
0
30 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
119
1
0
18 Dec 2024
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small
  LLMs
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs
Aldo Pareja
Nikhil Shivakumar Nayak
Hao Wang
Krishnateja Killamsetty
Shivchander Sudalairaj
...
Guangxuan Xu
Kai Xu
Ligong Han
Luke Inglis
Akash Srivastava
88
6
0
17 Dec 2024
EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
Jialiang Cheng
Ning Gao
Yun Yue
Zhiling Ye
Jiadi Jiang
Jian Sha
OffRL
77
0
0
10 Dec 2024
Towards Resource Efficient and Interpretable Bias Mitigation in Large
  Language Models
Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models
S. Tong
Eliott Zemour
Rawisara Lohanimit
Lalana Kagal
60
0
0
02 Dec 2024
Adapting Large Language Models to Log Analysis with Interpretable Domain
  Knowledge
Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge
Yuhe Ji
Yilun Liu
Feiyu Yao
Minggui He
Shimin Tao
...
Xinhua Yang
Weibin Meng
Yuming Xie
Boxing Chen
Hao Yang
90
3
0
02 Dec 2024
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
Y. Wang
Shiju Wang
Shenhan Zhu
Fangcheng Fu
Xinyi Liu
Xuefeng Xiao
Huixia Li
Jiashi Li
Faming Wu
Bin Cui
96
3
0
02 Dec 2024
Hiding Communication Cost in Distributed LLM Training via Micro-batch
  Co-execution
Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution
Haiquan Wang
Chaoyi Ruan
Jia He
Jiaqi Ruan
Chengjie Tang
Xiaosong Ma
Cheng-rong Li
73
1
0
24 Nov 2024
Towards a Middleware for Large Language Models
Towards a Middleware for Large Language Models
Narcisa Guran
Florian Knauf
Man Ngo
Stefan Petrescu
Jan S. Rellermeyer
73
1
0
21 Nov 2024
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
Mohammadali Shakerdargah
Shan Lu
Chao Gao
Di Niu
70
0
0
20 Nov 2024
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Jared Fernandez
Luca Wehrstedt
Leonid Shamis
Mostafa Elhoushi
Kalyan Saladi
Yonatan Bisk
Emma Strubell
Jacob Kahn
200
3
0
20 Nov 2024
Computational metaoptics for imaging
Computational metaoptics for imaging
Charles Roques-Carmes
Kai Wang
Yanting Yang
A. Majumdar
Zin Lin
24
1
0
14 Nov 2024
Accelerating Large Language Model Training with 4D Parallelism and
  Memory Consumption Estimator
Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
Kazuki Fujii
Kohei Watanabe
Rio Yokota
32
0
0
10 Nov 2024
Acceleration for Deep Reinforcement Learning using Parallel and
  Distributed Computing: A Survey
Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey
Zhihong Liu
Xin Xu
Peng Qiao
Dongsheng Li
OffRL
22
2
0
08 Nov 2024
Flashy Backdoor: Real-world Environment Backdoor Attack on SNNs with DVS
  Cameras
Flashy Backdoor: Real-world Environment Backdoor Attack on SNNs with DVS Cameras
Roberto Riaño
Gorka Abad
S. Picek
A. Urbieta
AAML
36
0
0
05 Nov 2024
Photon: Federated LLM Pre-Training
Photon: Federated LLM Pre-Training
Lorenzo Sani
Alex Iacob
Zeyu Cao
Royson Lee
Bill Marino
...
Dongqi Cai
Zexi Li
Wanru Zhao
Xinchi Qiu
Nicholas D. Lane
AI4CE
36
7
0
05 Nov 2024
Context Parallelism for Scalable Million-Token Inference
Context Parallelism for Scalable Million-Token Inference
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jeremy Reizenstein
Jongsoo Park
Jianyu Huang
MoE
LRM
62
5
0
04 Nov 2024
HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO
  Computation Redundancy
HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy
Shuqing Luo
Jie Peng
Pingzhi Li
Tianlong Chen
MoE
31
2
0
02 Nov 2024
Data movement limits to frontier model training
Data movement limits to frontier model training
Ege Erdil
David Schneider-Joseph
38
1
0
02 Nov 2024
Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer
  Models
Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
Runsheng Benson Guo
Utkarsh Anand
Arthur Chen
Khuzaima Daudjee
42
1
0
01 Nov 2024
MoNTA: Accelerating Mixture-of-Experts Training with
  Network-Traffc-Aware Parallel Optimization
MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization
J. Guo
Yan Liu
Yu Meng
Zhiwei Tao
Banglan Liu
Gang Chen
Xiang Li
MoE
30
0
0
01 Nov 2024
SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
Ruisi Zhang
Tianyu Liu
Will Feng
Andrew Gu
Sanket Purandare
Wanchao Liang
Francisco Massa
31
1
0
01 Nov 2024
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources
100Kor100Days:Trade−offswhenPre−TrainingwithAcademicResources100K or 100 Days: Trade-offs when Pre-Training with Academic Resources100Kor100Days:Trade−offswhenPre−TrainingwithAcademicResources
Apoorv Khandelwal
Tian Yun
Nihal V. Nayak
Jack Merullo
Stephen H. Bach
Chen Sun
Ellie Pavlick
VLM
AI4CE
OnRL
63
2
0
30 Oct 2024
KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and
  Knowledge Distillation
KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation
Rambod Azimi
Rishav Rishav
M. Teichmann
Samira Ebrahimi Kahou
ALM
26
0
0
28 Oct 2024
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating
  Large Models Training
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
Houming Wu
Ling Chen
Wenjie Yu
AI4CE
19
0
0
25 Oct 2024
Prompting and Fine-Tuning of Small LLMs for Length-Controllable
  Telephone Call Summarization
Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization
David Thulke
Yingbo Gao
Rricha Jalota
Christian Dugast
Hermann Ney
29
3
0
24 Oct 2024
Leveraging the Domain Adaptation of Retrieval Augmented Generation
  Models for Question Answering and Reducing Hallucination
Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination
Salman Rakin
Md. A. R. Shibly
Zahin M. Hossain
Zeeshan Khan
Md. Mostofa Akbar
23
1
0
23 Oct 2024
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale
  Models via Malleable Data and Model Parallelization
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jiawen Niu
Y. Wang
Hailin Zhang
Xiaonan Nie
Bin Cui
MoMe
35
2
0
17 Oct 2024
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs
  with Adaptive Compression
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
Zhenheng Tang
Xueze Kang
Yiming Yin
Xinglin Pan
Yuxin Wang
...
Shaohuai Shi
Amelie Chi Zhou
Bo Li
Bingsheng He
Xiaowen Chu
AI4CE
78
8
0
16 Oct 2024
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale
  Hybrid-Parallel Training
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Tianyuan Wu
Wei Wang
Yinghao Yu
Siran Yang
Wenchao Wu
Qinkai Duan
Guodong Yang
Jiamang Wang
Lin Qu
Liping Zhang
35
6
0
16 Oct 2024
AI, Climate, and Regulation: From Data Centers to the AI Act
AI, Climate, and Regulation: From Data Centers to the AI Act
Kai Ebert
Nicolas Alder
Ralf Herbrich
Philipp Hacker
AI4CE
32
0
0
09 Oct 2024
TorchTitan: One-stop PyTorch native solution for production ready LLM
  pre-training
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
Wanchao Liang
Tianyu Liu
Less Wright
Will Constable
Andrew Gu
...
Howard Huang
Junjie Wang
Sanket Purandare
Gokul Nadathur
Stratos Idreos
OffRL
35
13
0
09 Oct 2024
LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM
  Inferences
LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences
Zhenxiao Fu
Fan Chen
Shan Zhou
Haitong Li
Lei Jiang
35
4
0
03 Oct 2024
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference
  Services
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services
Małgorzata Łazuka
Andreea Anghel
Thomas Parnell
27
10
0
03 Oct 2024
Comprehensive Performance Modeling and System Design Insights for
  Foundation Models
Comprehensive Performance Modeling and System Design Insights for Foundation Models
Shashank Subramanian
Ermal Rrapaj
Peter Harrington
Smeet Chheda
S. Farrell
Brian Austin
Samuel Williams
N. Wright
W. Bhimji
39
0
0
30 Sep 2024
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng
Chi Zhang
Zilingfeng Ye
Xibin Wu
Wang Zhang
Ru Zhang
Size Zheng
Haibin Lin
Chuan Wu
AI4CE
36
71
0
28 Sep 2024
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
Daiyaan Arfeen
Zhen Zhang
Xinwei Fu
G. R. Ganger
Yida Wang
AI4CE
33
0
0
23 Sep 2024
Domino: Eliminating Communication in LLM Training via Generic Tensor
  Slicing and Overlapping
Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
Guanhua Wang
Chengming Zhang
Zheyu Shen
Ang Li
Olatunji Ruwase
36
3
0
23 Sep 2024
A Large Language Model and Denoising Diffusion Framework for Targeted
  Design of Microstructures with Commands in Natural Language
A Large Language Model and Denoising Diffusion Framework for Targeted Design of Microstructures with Commands in Natural Language
Nikita Kartashov
Nikolaos N. Vlassis
DiffM
AI4CE
30
1
0
22 Sep 2024
Drift to Remember
Drift to Remember
Jin Du
Xiaotian Zhang
Hao Shen
Xun Xian
Ganghua Wang
Jiawei Zhang
Yuhong Yang
Na Li
Jia Liu
Jie Ding
CLL
21
0
0
21 Sep 2024
Exploring Scaling Laws for Local SGD in Large Language Model Training
Exploring Scaling Laws for Local SGD in Large Language Model Training
Qiaozhi He
Xiaomin Zhuang
Zhihua Wu
20
4
0
20 Sep 2024
Scaling Smart: Accelerating Large Language Model Pre-training with Small
  Model Initialization
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
Mohammad Samragh
Iman Mirzadeh
Keivan Alizadeh Vahid
Fartash Faghri
Minsik Cho
Moin Nabi
Devang Naik
Mehrdad Farajtabar
LRM
AI4CE
27
6
0
19 Sep 2024
Performance and Power: Systematic Evaluation of AI Workloads on
  Accelerators with CARAML
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
Chelsea Maria John
Stepan Nassyr
Carolin Penke
A. Herten
28
0
0
19 Sep 2024
Achieving Peak Performance for Large Language Models: A Systematic
  Review
Achieving Peak Performance for Large Language Models: A Systematic Review
Z. R. K. Rostam
Sándor Szénási
Gábor Kertész
37
3
0
07 Sep 2024
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale
  Model-in-Network Data-Parallel Training on Distributed GPUs
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
Mo Sun
Zihan Yang
Changyue Liao
Yingtao Li
Fei Wu
Zeke Wang
57
1
0
02 Sep 2024
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep
  Learning
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
Wei An
Xiao Bi
Guanting Chen
Shanhuang Chen
Chengqi Deng
...
Chenggang Zhao
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Yuheng Zou
41
6
0
26 Aug 2024
Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous
  GPU Clusters
Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
WenZheng Zhang
Yang Hu
Jing Shi
Xiaoying Bai
42
1
0
22 Aug 2024
Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for
  Transformer Pretraining
Mixed Sparsity Training: Achieving 4×\times× FLOP Reduction for Transformer Pretraining
Pihe Hu
Shaolong Li
Longbo Huang
30
0
0
21 Aug 2024
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster
  Scheduling
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
Xinyi Zhang
Hanyu Zhao
Wencong Xiao
Xianyan Jia
Fei Xu
Yong Li
Wei Lin
Fangming Liu
22
2
0
16 Aug 2024
Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for
  Collaborative DNN Training on Heterogeneous Edge Devices
Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
Shengyuan Ye
Liekang Zeng
Xiaowen Chu
Guoliang Xing
Xu Chen
38
11
0
15 Aug 2024
Previous
12345678
Next