Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

4 June 2021

Papers citing "Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models"

5 / 5 papers shown

Title
PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation Jaejung Seol Seojun Kim Jaejun Yoo 3DV VLM 36 7 0 01 Apr 2024
ZeRO-Offload: Democratizing Billion-Scale Model Training Jie Ren Samyam Rajbhandari Reza Yazdani Aminabadi Olatunji Ruwase Shuangyang Yang Minjia Zhang Dong Li Yuxiong He MoE 177 416 0 18 Jan 2021
Big Bird: Transformers for Longer Sequences Manzil Zaheer Guru Guruganesh Kumar Avinava Dubey Joshua Ainslie Chris Alberti ... Philip Pham Anirudh Ravula Qifan Wang Li Yang Amr Ahmed VLM 285 2,017 0 28 Jul 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 264 4,489 0 23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,826 0 17 Sep 2019