Adam Accumulation to Reduce Memory Footprints of both Activations and
Gradients for Large-scale DNN Training

Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

31 May 2023

Fan Yang

ArXiv (abs)PDF HTML

Papers citing "Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training"

18 / 18 papers shown

Title
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model Shaden Smith M. Patwary Brandon Norick P. LeGresley Samyam Rajbhandari ... Mohammad Shoeybi Yuxiong He Michael Houston Saurabh Tiwary Bryan Catanzaro MoE 163 743 0 28 Jan 2022
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning Samyam Rajbhandari Olatunji Ruwase Jeff Rasley Shaden Smith Yuxiong He GNN 91 388 0 16 Apr 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training Jie Ren Samyam Rajbhandari Reza Yazdani Aminabadi Olatunji Ruwase Shuangyang Yang Minjia Zhang Dong Li Yuxiong He MoE 274 433 0 18 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai ... Matthias Minderer G. Heigold Sylvain Gelly Jakob Uszkoreit N. Houlsby ViT 684 41,483 0 22 Oct 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 901 42,463 0 28 May 2020
Training Large Neural Networks with Constant Memory using a New Execution Algorithm B. Pudipeddi Maral Mesmakhosroshahi Jinwen Xi S. Bharadwaj 66 58 0 13 Feb 2020
RoBERTa: A Robustly Optimized BERT Pretraining Approach Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy M. Lewis Luke Zettlemoyer Veselin Stoyanov AIMat 697 24,557 0 26 Jul 2019
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing Tan Quoc V. Le 3DV MedIm 172 18,193 0 28 May 2019
Low-Memory Neural Network Training: A Technical Report N. Sohoni Christopher R. Aberger Megan Leszczynski Jian Zhang Christopher Ré 65 102 0 24 Apr 2019
Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization Hesham Mostafa Xin Wang 96 314 0 15 Feb 2019
Memory-Efficient Adaptive Optimization Rohan Anil Vineet Gupta Tomer Koren Y. Singer ODL 77 49 0 30 Jan 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 1.1K 7,201 0 20 Apr 2018
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost Noam M. Shazeer Mitchell Stern ODL 86 1,052 0 11 Apr 2018
Mixed Precision Training Paulius Micikevicius Sharan Narang Jonah Alben G. Diamos Erich Elsen ... Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh Hao Wu 178 1,805 0 10 Oct 2017
Training Deep Nets with Sublinear Memory Cost Tianqi Chen Bing Xu Chiyuan Zhang Carlos Guestrin 108 1,174 0 21 Apr 2016
Deep Residual Learning for Image Recognition Kaiming He Xinming Zhang Shaoqing Ren Jian Sun MedIm 2.3K 194,510 0 10 Dec 2015
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe Christian Szegedy OOD 469 43,347 0 11 Feb 2015
Adam: A Method for Stochastic Optimization Diederik P. Kingma Jimmy Ba ODL 2.1K 150,364 0 22 Dec 2014

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.