Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.21231
Cited By
ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
28 February 2025
Hao Ge
Junda Feng
Qi Huang
Fangcheng Fu
Xiaonan Nie
Lei Zuo
Yanghua Peng
Tengjiao Wang
Xin Liu
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs"
13 / 13 papers shown
Title
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Juntao Zhao
Qi Lu
Wei Jia
Borui Wan
Lei Zuo
...
Size Zheng
Yanghua Peng
H. Lin
Xin Liu
Chuan Wu
AI4CE
139
0
0
14 Apr 2025
How to Train Long-Context Language Models (Effectively)
Tianyu Gao
Alexander Wettig
Howard Yen
Danqi Chen
RALM
177
48
0
03 Oct 2024
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
269
2,289
0
27 May 2022
Reducing Activation Recomputation in Large Transformer Models
V. Korthikanti
Jared Casper
Sangkug Lym
Lawrence C. McAfee
M. Andersch
Mohammad Shoeybi
Bryan Catanzaro
AI4CE
128
276
0
10 May 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
537
6,301
0
05 Apr 2022
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan
Mohammad Shoeybi
Jared Casper
P. LeGresley
M. Patwary
...
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
MoE
147
707
0
09 Apr 2021
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
908
42,520
0
28 May 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
653
4,925
0
23 Jan 2020
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari
Jeff Rasley
Olatunji Ruwase
Yuxiong He
ALM
AI4CE
86
921
0
04 Oct 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
347
1,920
0
17 Sep 2019
Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev
Mike Del Balso
102
1,222
0
15 Feb 2018
Ray: A Distributed Framework for Emerging AI Applications
Philipp Moritz
Robert Nishihara
Stephanie Wang
Alexey Tumanov
Richard Liaw
...
Melih Elibol
Zongheng Yang
William Paul
Michael I. Jordan
Ion Stoica
GNN
110
1,269
0
16 Dec 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
827
132,725
0
12 Jun 2017
1