Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.22549
Cited By
DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
28 May 2025
Alex Iacob
Lorenzo Sani
M. Safaryan
Paris Giampouras
Samuel Horváth
Andrej Jovanovic
Meghdad Kurmanji
Preslav Aleksandrov
William F. Shen
Xinchi Qiu
Nicholas D. Lane
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models"
41 / 41 papers shown
Title
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
Zachary B. Charles
Gabriel Teston
Lucio Dery
Keith Rush
Nova Fallen
Zachary Garrett
Arthur Szlam
Arthur Douillard
401
5
0
12 Mar 2025
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal
Anton Lozhkov
Elie Bakouch
Gabriel Martín Blázquez
Guilherme Penedo
...
Cyril Zakka
Mathieu Morlon
Colin Raffel
Leandro von Werra
Thomas Wolf
MoE
101
40
0
04 Feb 2025
Photon: Federated LLM Pre-Training
Lorenzo Sani
Alex Iacob
Zeyu Cao
Royson Lee
Bill Marino
...
Dongqi Cai
Zexi Li
Wanru Zhao
Xinchi Qiu
Nicholas D. Lane
AI4CE
67
9
0
05 Nov 2024
How Does Critical Batch Size Scale in Pre-training?
Hanlin Zhang
Depen Morwani
Nikhil Vyas
Jingfeng Wu
Difan Zou
Udaya Ghai
Dean Phillips Foster
Sham Kakade
125
15
0
29 Oct 2024
DEPT: Decoupled Embeddings for Pre-training Language Models
Alex Iacob
Lorenzo Sani
Meghdad Kurmanji
William F. Shen
Xinchi Qiu
Dongqi Cai
Yan Gao
Nicholas D. Lane
VLM
527
1
0
07 Oct 2024
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini
Pierre Ablin
David Grangier
ODL
56
12
0
05 Sep 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
105
243
0
25 Jun 2024
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Alexander Hägele
Elie Bakouch
Atli Kosson
Loubna Ben Allal
Leandro von Werra
Martin Jaggi
79
43
0
28 May 2024
The Future of Large Language Model Pre-training is Federated
Lorenzo Sani
Alexandru Iacob
Zeyu Cao
Bill Marino
Yan Gao
...
Wanru Zhao
William F. Shen
Preslav Aleksandrov
Xinchi Qiu
Nicholas D. Lane
AI4CE
110
19
0
17 May 2024
Asynchronous Local-SGD Training for Language Modeling
Bo Liu
Rachita Chhaparia
Arthur Douillard
Satyen Kale
Andrei A. Rusu
Jiajun Shen
Arthur Szlam
MarcÁurelio Ranzato
FedML
57
11
0
17 Jan 2024
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Nikhil Sardana
Jacob P. Portes
Sasha Doubov
Jonathan Frankle
LRM
296
84
0
31 Dec 2023
DiLoCo: Distributed Low-Communication Training of Language Models
Arthur Douillard
Qixuang Feng
Andrei A. Rusu
Rachita Chhaparia
Yani Donchev
A. Kuncoro
MarcÁurelio Ranzato
Arthur Szlam
Jiajun Shen
94
38
0
14 Nov 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark Schmidt
82
71
0
27 Apr 2023
Stable and low-precision training for large-scale vision-language models
Mitchell Wortsman
Tim Dettmers
Luke Zettlemoyer
Ari S. Morcos
Ali Farhadi
Ludwig Schmidt
MQ
MLLM
VLM
107
43
0
25 Apr 2023
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao
Andrew Gu
R. Varma
Liangchen Luo
Chien-chin Huang
...
Bernard Nguyen
Geeta Chauhan
Y. Hao
Ajit Mathews
Shen Li
FedML
MoE
86
341
0
21 Apr 2023
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
147
373
0
13 Feb 2023
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
379
2,385
0
09 Nov 2022
Revisiting Optimal Convergence Rate for Smooth and Non-convex Stochastic Decentralized Optimization
Kun Yuan
Xinmeng Huang
Yiming Chen
Xiaohan Zhang
Yingya Zhang
Pan Pan
49
21
0
14 Oct 2022
Adam Can Converge Without Any Modification On Update Rules
Yushun Zhang
Congliang Chen
Naichen Shi
Ruoyu Sun
Zhimin Luo
38
67
0
20 Aug 2022
On Distributed Adaptive Optimization with Gradient Compression
Xiaoyun Li
Belhal Karimi
Ping Li
46
27
0
11 May 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
471
6,231
0
05 Apr 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
197
1,946
0
29 Mar 2022
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
268
2,443
0
20 Apr 2021
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
743
41,932
0
28 May 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
599
4,801
0
23 Jan 2020
Advances and Open Problems in Federated Learning
Peter Kairouz
H. B. McMahan
Brendan Avent
A. Bellet
M. Bennis
...
Zheng Xu
Qiang Yang
Felix X. Yu
Han Yu
Sen Zhao
FedML
AI4CE
234
6,252
0
10 Dec 2019
Lower Bounds for Non-Convex Stochastic Optimization
Yossi Arjevani
Y. Carmon
John C. Duchi
Dylan J. Foster
Nathan Srebro
Blake E. Woodworth
71
358
0
05 Dec 2019
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk
Rowan Zellers
Ronan Le Bras
Jianfeng Gao
Yejin Choi
OOD
LRM
142
1,792
0
26 Nov 2019
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari
Jeff Rasley
Olatunji Ruwase
Yuxiong He
ALM
AI4CE
82
881
0
04 Oct 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
326
1,899
0
17 Sep 2019
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
Yejin Choi
168
2,468
0
19 May 2019
On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization
Hao Yu
Rong Jin
Sen Yang
FedML
89
384
0
09 May 2019
On the Convergence of Adam and Beyond
Sashank J. Reddi
Satyen Kale
Surinder Kumar
93
2,499
0
19 Apr 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
230
996
0
01 Apr 2019
Local SGD Converges Fast and Communicates Little
Sebastian U. Stich
FedML
166
1,061
0
24 May 2018
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELM
RALM
LRM
158
2,587
0
14 Mar 2018
Horovod: fast and easy distributed deep learning in TensorFlow
Alexander Sergeev
Mike Del Balso
97
1,221
0
15 Feb 2018
Don't Decay the Learning Rate, Increase the Batch Size
Samuel L. Smith
Pieter-Jan Kindermans
Chris Ying
Quoc V. Le
ODL
99
995
0
01 Nov 2017
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
Lukas Balles
Philipp Hennig
70
169
0
22 May 2017
Communication-Efficient Learning of Deep Networks from Decentralized Data
H. B. McMahan
Eider Moore
Daniel Ramage
S. Hampson
Blaise Agüera y Arcas
FedML
397
17,468
0
17 Feb 2016
On the difficulty of training Recurrent Neural Networks
Razvan Pascanu
Tomas Mikolov
Yoshua Bengio
ODL
190
5,342
0
21 Nov 2012
1