ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.16903
  4. Cited By
Spike No More: Stabilizing the Pre-training of Large Language Models

Spike No More: Stabilizing the Pre-training of Large Language Models

28 December 2023
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
ArXivPDFHTML

Papers citing "Spike No More: Stabilizing the Pre-training of Large Language Models"

18 / 18 papers shown
Title
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu Zhang
Gaojie Jin
Xianrui Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
64
1
0
24 Feb 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Tianjin Huang
Ziquan Zhu
Gaojie Jin
Lu Liu
Zhangyang Wang
Shiwei Liu
65
3
0
12 Jan 2025
Small-scale proxies for large-scale Transformer training instabilities
Small-scale proxies for large-scale Transformer training instabilities
Mitchell Wortsman
Peter J. Liu
Lechao Xiao
Katie Everett
A. Alemi
...
Jascha Narain Sohl-Dickstein
Kelvin Xu
Jaehoon Lee
Justin Gilmer
Simon Kornblith
54
93
0
25 Sep 2023
What Language Model to Train if You Have One Million GPU Hours?
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
Thomas Wang
Daniel Hesslow
Lucile Saulnier
Stas Bekman
...
Lintang Sutawika
Jaesung Tae
Zheng-Xin Yong
Julien Launay
Iz Beltagy
MoE
AI4CE
245
105
0
27 Oct 2022
8-bit Optimizers via Block-wise Quantization
8-bit Optimizers via Block-wise Quantization
Tim Dettmers
M. Lewis
Sam Shleifer
Luke Zettlemoyer
MQ
92
286
0
06 Oct 2021
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup
  for Training GPT Models
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
Conglong Li
Minjia Zhang
Yuxiong He
36
38
0
13 Aug 2021
How to Train BERT with an Academic Budget
How to Train BERT with an Academic Budget
Peter Izsak
Moshe Berchansky
Omer Levy
77
116
0
15 Apr 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using
  Megatron-LM
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Deepak Narayanan
Mohammad Shoeybi
Jared Casper
P. LeGresley
M. Patwary
...
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
MoE
66
667
0
09 Apr 2021
On Layer Normalization in the Transformer Architecture
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
80
973
0
12 Feb 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
435
4,662
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
281
1,861
0
17 Sep 2019
Scaling Neural Machine Translation
Scaling Neural Machine Translation
Myle Ott
Sergey Edunov
David Grangier
Michael Auli
AIMat
149
611
0
01 Jun 2018
A Call for Clarity in Reporting BLEU Scores
A Call for Clarity in Reporting BLEU Scores
Matt Post
81
2,941
0
23 Apr 2018
Tying Word Vectors and Word Classifiers: A Loss Framework for Language
  Modeling
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Hakan Inan
Khashayar Khosravi
R. Socher
80
384
0
04 Nov 2016
Using the Output Embedding to Improve Language Models
Using the Output Embedding to Improve Language Models
Ofir Press
Lior Wolf
51
731
0
20 Aug 2016
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich
Barry Haddow
Alexandra Birch
149
7,683
0
31 Aug 2015
Training Very Deep Networks
Training Very Deep Networks
R. Srivastava
Klaus Greff
Jürgen Schmidhuber
93
1,675
0
22 Jul 2015
Delving Deep into Rectifiers: Surpassing Human-Level Performance on
  ImageNet Classification
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
VLM
151
18,534
0
06 Feb 2015
1