Spike No More: Stabilizing the Pre-training of Large Language Models

Spike No More: Stabilizing the Pre-training of Large Language Models

28 December 2023

Sosuke Kobayashi

Papers citing "Spike No More: Stabilizing the Pre-training of Large Language Models"

18 / 18 papers shown

Title
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam Tianjin Huang Haotian Hu Zhenyu Zhang Gaojie Jin Xianrui Li ... Tianlong Chen Lu Liu Qingsong Wen Zhangyang Wang Shiwei Liu MQ 54 1 0 24 Feb 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training Tianjin Huang Ziquan Zhu Gaojie Jin Lu Liu Zhangyang Wang Shiwei Liu 63 3 0 12 Jan 2025
Small-scale proxies for large-scale Transformer training instabilities Mitchell Wortsman Peter J. Liu Lechao Xiao Katie Everett A. Alemi ... Jascha Narain Sohl-Dickstein Kelvin Xu Jaehoon Lee Justin Gilmer Simon Kornblith 54 93 0 25 Sep 2023
What Language Model to Train if You Have One Million GPU Hours? Teven Le Scao Thomas Wang Daniel Hesslow Lucile Saulnier Stas Bekman ... Lintang Sutawika Jaesung Tae Zheng-Xin Yong Julien Launay Iz Beltagy MoE AI4CE 243 105 0 27 Oct 2022
8-bit Optimizers via Block-wise Quantization Tim Dettmers M. Lewis Sam Shleifer Luke Zettlemoyer MQ 90 286 0 06 Oct 2021
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models Conglong Li Minjia Zhang Yuxiong He 33 38 0 13 Aug 2021
How to Train BERT with an Academic Budget Peter Izsak Moshe Berchansky Omer Levy 71 116 0 15 Apr 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM Deepak Narayanan Mohammad Shoeybi Jared Casper P. LeGresley M. Patwary ... Prethvi Kashinkunti J. Bernauer Bryan Catanzaro Amar Phanishayee Matei A. Zaharia MoE 49 667 0 09 Apr 2021
On Layer Normalization in the Transformer Architecture Ruibin Xiong Yunchang Yang Di He Kai Zheng Shuxin Zheng Chen Xing Huishuai Zhang Yanyan Lan Liwei Wang Tie-Yan Liu AI4CE 74 973 0 12 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 368 4,662 0 23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Mohammad Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 276 1,861 0 17 Sep 2019
Scaling Neural Machine Translation Myle Ott Sergey Edunov David Grangier Michael Auli AIMat 138 611 0 01 Jun 2018
A Call for Clarity in Reporting BLEU Scores Matt Post 73 2,941 0 23 Apr 2018
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling Hakan Inan Khashayar Khosravi R. Socher 76 384 0 04 Nov 2016
Using the Output Embedding to Improve Language Models Ofir Press Lior Wolf 46 731 0 20 Aug 2016
Neural Machine Translation of Rare Words with Subword Units Rico Sennrich Barry Haddow Alexandra Birch 131 7,683 0 31 Aug 2015
Training Very Deep Networks R. Srivastava Klaus Greff Jürgen Schmidhuber 77 1,675 0 22 Jul 2015
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He Xinming Zhang Shaoqing Ren Jian Sun VLM 95 18,534 0 06 Feb 2015

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.