ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2108.06084
  4. Cited By
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup
  for Training GPT Models

The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

13 August 2021
Conglong Li
Minjia Zhang
Yuxiong He
ArXivPDFHTML

Papers citing "The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models"

41 / 41 papers shown
Title
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Juntao Zhao
Qi Lu
Wei Jia
Borui Wan
Lei Zuo
...
Size Zheng
Yanghua Peng
H. Lin
Xin Liu
Chuan Wu
AI4CE
92
0
0
14 Apr 2025
Attention Condensation via Sparsity Induced Regularized Training
Eli Sason
Darya Frolova
Boris Nazarov
Felix Goldberd
407
0
0
03 Mar 2025
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
355
6,132
0
05 Apr 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
146
1,915
0
29 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
66
155
0
07 Mar 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A
  Large-Scale Generative Language Model
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
Mohammad Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
141
737
0
28 Jan 2022
Curriculum learning for language modeling
Curriculum learning for language modeling
Daniel Fernando Campos
35
32
0
04 Aug 2021
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
  with LAMB's Convergence Speed
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
Conglong Li
A. A. Awan
Hanlin Tang
Samyam Rajbhandari
Yuxiong He
55
33
0
13 Apr 2021
1-bit Adam: Communication Efficient Large-Scale Training with Adam's
  Convergence Speed
1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
Hanlin Tang
Shaoduo Gan
A. A. Awan
Samyam Rajbhandari
Conglong Li
Xiangru Lian
Ji Liu
Ce Zhang
Yuxiong He
AI4CE
58
85
0
04 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
404
2,051
0
31 Dec 2020
Shortformer: Better Language Modeling using Shorter Inputs
Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press
Noah A. Smith
M. Lewis
253
89
0
31 Dec 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
546
41,106
0
28 May 2020
How Can We Accelerate Progress Towards Human-like Linguistic
  Generalization?
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Tal Linzen
253
194
0
03 May 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
485
4,662
0
23 Jan 2020
PIQA: Reasoning about Physical Commonsense in Natural Language
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk
Rowan Zellers
Ronan Le Bras
Jianfeng Gao
Yejin Choi
OOD
LRM
103
1,724
0
26 Nov 2019
Adversarial NLI: A New Benchmark for Natural Language Understanding
Adversarial NLI: A New Benchmark for Natural Language Understanding
Yixin Nie
Adina Williams
Emily Dinan
Joey Tianyi Zhou
Jason Weston
Douwe Kiela
101
991
0
31 Oct 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
337
19,824
0
23 Oct 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
303
1,861
0
17 Sep 2019
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi
Ronan Le Bras
Chandra Bhagavatula
Yejin Choi
62
211
0
24 Jul 2019
Defending Against Neural Fake News
Defending Against Neural Fake News
Rowan Zellers
Ari Holtzman
Hannah Rashkin
Yonatan Bisk
Ali Farhadi
Franziska Roesner
Yejin Choi
AAML
108
1,019
0
29 May 2019
Simple and Effective Curriculum Pointer-Generator Networks for Reading
  Comprehension over Long Narratives
Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives
Yi Tay
Shuohang Wang
Anh Tuan Luu
Jie Fu
Minh C. Phan
Xingdi Yuan
J. Rao
S. Hui
Aston Zhang
75
109
0
26 May 2019
HellaSwag: Can a Machine Really Finish Your Sentence?
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
Yejin Choi
118
2,373
0
19 May 2019
Curriculum Learning for Domain Adaptation in Neural Machine Translation
Curriculum Learning for Domain Adaptation in Neural Machine Translation
Xuan Zhang
Pamela Shapiro
Manish Kumar
Paul McNamee
Marine Carpuat
Kevin Duh
55
124
0
14 May 2019
Control Regularization for Reduced Variance Reinforcement Learning
Control Regularization for Reduced Variance Reinforcement Learning
Richard Cheng
Abhinav Verma
G. Orosz
Swarat Chaudhuri
Yisong Yue
J. W. Burdick
OffRL
59
77
0
14 May 2019
SuperGLUE: A Stickier Benchmark for General-Purpose Language
  Understanding Systems
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Alex Jinpeng Wang
Yada Pruksachatkun
Nikita Nangia
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
205
2,296
0
02 May 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
186
991
0
01 Apr 2019
Competence-based Curriculum Learning for Neural Machine Translation
Competence-based Curriculum Learning for Neural Machine Translation
Emmanouil Antonios Platanios
Otilia Stretcu
Graham Neubig
Barnabás Póczós
Tom Michael Mitchell
80
340
0
23 Mar 2019
An Empirical Exploration of Curriculum Learning for Neural Machine
  Translation
An Empirical Exploration of Curriculum Learning for Neural Machine Translation
Xuan Zhang
Manish Kumar
Huda Khayrallah
Kenton W. Murray
Jeremy Gwinnup
Marianna J. Martindale
Paul McNamee
Kevin Duh
Marine Carpuat
ODL
64
110
0
02 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.2K
93,936
0
11 Oct 2018
Variance Reduction for Reinforcement Learning in Input-Driven
  Environments
Variance Reduction for Reinforcement Learning in Input-Driven Environments
Hongzi Mao
S. Venkatakrishnan
Malte Schwarzkopf
Mohammad Alizadeh
OffRL
65
95
0
06 Jul 2018
A Simple Method for Commonsense Reasoning
A Simple Method for Commonsense Reasoning
Trieu H. Trinh
Quoc V. Le
LRM
ReLM
89
432
0
07 Jun 2018
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning
  Challenge
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ELM
RALM
LRM
113
2,474
0
14 Mar 2018
Don't Decay the Learning Rate, Increase the Batch Size
Don't Decay the Learning Rate, Increase the Batch Size
Samuel L. Smith
Pieter-Jan Kindermans
Chris Ying
Quoc V. Le
ODL
95
990
0
01 Nov 2017
Mixed Precision Training
Mixed Precision Training
Paulius Micikevicius
Sharan Narang
Jonah Alben
G. Diamos
Erich Elsen
...
Boris Ginsburg
Michael Houston
Oleksii Kuchaiev
Ganesh Venkatesh
Hao Wu
143
1,779
0
10 Oct 2017
Curriculum Learning and Minibatch Bucketing in Neural Machine
  Translation
Curriculum Learning and Minibatch Bucketing in Neural Machine Translation
Tom Kocmi
Ondrej Bojar
40
140
0
29 Jul 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
524
129,831
0
12 Jun 2017
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for
  Reading Comprehension
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi
Eunsol Choi
Daniel S. Weld
Luke Zettlemoyer
RALM
183
2,610
0
09 May 2017
RACE: Large-scale ReAding Comprehension Dataset From Examinations
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai
Qizhe Xie
Hanxiao Liu
Yiming Yang
Eduard H. Hovy
ELM
148
1,329
0
15 Apr 2017
Averaged-DQN: Variance Reduction and Stabilization for Deep
  Reinforcement Learning
Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning
Oron Anschel
Nir Baram
N. Shimkin
60
315
0
07 Nov 2016
The LAMBADA dataset: Word prediction requiring a broad discourse context
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno
Germán Kruszewski
Angeliki Lazaridou
Q. N. Pham
Raffaella Bernardi
Sandro Pezzelle
Marco Baroni
Gemma Boleda
Raquel Fernández
106
698
0
20 Jun 2016
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
1.2K
149,474
0
22 Dec 2014
1