ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.06211
  4. Cited By
Staged Training for Transformer Language Models

Staged Training for Transformer Language Models

11 March 2022
Sheng Shen
Pete Walsh
Kurt Keutzer
Jesse Dodge
Matthew E. Peters
Iz Beltagy
ArXivPDFHTML

Papers citing "Staged Training for Transformer Language Models"

22 / 22 papers shown
Title
STEP: Staged Parameter-Efficient Pre-training for Large Language Models
STEP: Staged Parameter-Efficient Pre-training for Large Language Models
Kazuki Yano
Takumi Ito
Jun Suzuki
LRM
85
1
0
05 Apr 2025
Stacking as Accelerated Gradient Descent
Stacking as Accelerated Gradient Descent
Naman Agarwal
Pranjal Awasthi
Satyen Kale
Eric Zhao
ODL
89
2
0
20 Feb 2025
FLM-101B: An Open LLM and How to Train It with $100K Budget
FLM-101B: An Open LLM and How to Train It with 100KBudget100K Budget100KBudget
Xiang Li
Yiqun Yao
Xin Jiang
Xuezhi Fang
Xuying Meng
...
Li Du
Bowen Qin
Zheng Zhang
Aixin Sun
Yequan Wang
75
22
0
07 Sep 2023
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
116
1,894
0
29 Mar 2022
GradMax: Growing Neural Networks using Gradient Information
GradMax: Growing Neural Networks using Gradient Information
Utku Evci
B. V. Merrienboer
Thomas Unterthiner
Max Vladymyrov
Fabian Pedregosa
31
53
0
13 Jan 2022
Scaling Language Models: Methods, Analysis & Insights from Training
  Gopher
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae
Sebastian Borgeaud
Trevor Cai
Katie Millican
Jordan Hoffmann
...
Jeff Stanway
L. Bennett
Demis Hassabis
Koray Kavukcuoglu
G. Irving
48
1,303
0
08 Dec 2021
Shortformer: Better Language Modeling using Shorter Inputs
Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press
Noah A. Smith
M. Lewis
250
89
0
31 Dec 2020
On the Transformer Growth for Progressive BERT Training
On the Transformer Growth for Progressive BERT Training
Xiaotao Gu
Liyuan Liu
Hongkun Yu
Jing Li
Chong Chen
Jiawei Han
VLM
77
51
0
23 Oct 2020
Shallow-to-Deep Training for Neural Machine Translation
Shallow-to-Deep Training for Neural Machine Translation
Bei Li
Ziyang Wang
Hui Liu
Yufan Jiang
Quan Du
Tong Xiao
Huizhen Wang
Jingbo Zhu
22
49
0
08 Oct 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
458
41,106
0
28 May 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
437
4,662
0
23 Jan 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
258
19,824
0
23 Oct 2019
Splitting Steepest Descent for Growing Neural Architectures
Splitting Steepest Descent for Growing Neural Architectures
Qiang Liu
Lemeng Wu
Dilin Wang
35
61
0
06 Oct 2019
Green AI
Green AI
Roy Schwartz
Jesse Dodge
Noah A. Smith
Oren Etzioni
80
1,124
0
22 Jul 2019
Deep contextualized word representations
Deep contextualized word representations
Matthew E. Peters
Mark Neumann
Mohit Iyyer
Matt Gardner
Christopher Clark
Kenton Lee
Luke Zettlemoyer
NAI
99
11,520
0
15 Feb 2018
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
427
129,831
0
12 Jun 2017
Pointer Sentinel Mixture Models
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
152
2,783
0
26 Sep 2016
The LAMBADA dataset: Word prediction requiring a broad discourse context
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno
Germán Kruszewski
Angeliki Lazaridou
Q. N. Pham
Raffaella Bernardi
Sandro Pezzelle
Marco Baroni
Gemma Boleda
Raquel Fernández
70
687
0
20 Jun 2016
Progressive Neural Networks
Progressive Neural Networks
Andrei A. Rusu
Neil C. Rabinowitz
Guillaume Desjardins
Hubert Soyer
J. Kirkpatrick
Koray Kavukcuoglu
Razvan Pascanu
R. Hadsell
CLL
AI4CE
43
2,428
0
15 Jun 2016
Network Morphism
Network Morphism
Tao Wei
Changhu Wang
Y. Rui
Chen Chen
61
176
0
05 Mar 2016
Net2Net: Accelerating Learning via Knowledge Transfer
Net2Net: Accelerating Learning via Knowledge Transfer
Tianqi Chen
Ian Goodfellow
Jonathon Shlens
88
663
0
18 Nov 2015
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
736
149,474
0
22 Dec 2014
1