ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.02034
  4. Cited By
Effective Theory of Transformers at Initialization

Effective Theory of Transformers at Initialization

4 April 2023
Emily Dinan
Sho Yaida
Susan Zhang
ArXiv (abs)PDFHTML

Papers citing "Effective Theory of Transformers at Initialization"

45 / 45 papers shown
Title
Meta-Principled Family of Hyperparameter Scaling Strategies
Meta-Principled Family of Hyperparameter Scaling Strategies
Sho Yaida
111
16
0
10 Oct 2022
OPT: Open Pre-trained Transformer Language Models
OPT: Open Pre-trained Transformer Language Models
Susan Zhang
Stephen Roller
Naman Goyal
Mikel Artetxe
Moya Chen
...
Daniel Simig
Punit Singh Koura
Anjali Sridhar
Tianlu Wang
Luke Zettlemoyer
VLMOSLMAI4CE
377
3,700
0
02 May 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILMLRM
537
6,301
0
05 Apr 2022
Language Models that Seek for Knowledge: Modular Search & Generation for
  Dialogue and Prompt Completion
Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion
Kurt Shuster
M. Komeili
Leonard Adolphs
Stephen Roller
Arthur Szlam
Jason Weston
KELM
101
128
0
24 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
114
168
0
07 Mar 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A
  Large-Scale Generative Language Model
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
Mohammad Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
163
743
0
28 Jan 2022
A ConvNet for the 2020s
A ConvNet for the 2020s
Zhuang Liu
Hanzi Mao
Chaozheng Wu
Christoph Feichtenhofer
Trevor Darrell
Saining Xie
ViT
193
5,226
0
10 Jan 2022
Scaling Language Models: Methods, Analysis & Insights from Training
  Gopher
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae
Sebastian Borgeaud
Trevor Cai
Katie Millican
Jordan Hoffmann
...
Jeff Stanway
L. Bennett
Demis Hassabis
Koray Kavukcuoglu
G. Irving
143
1,326
0
08 Dec 2021
Critical Initialization of Wide and Deep Neural Networks through Partial
  Jacobians: General Theory and Applications
Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications
Darshil Doshi
Tianyu He
Andrey Gromov
69
10
0
23 Nov 2021
Early Convolutions Help Transformers See Better
Early Convolutions Help Transformers See Better
Tete Xiao
Mannat Singh
Eric Mintun
Trevor Darrell
Piotr Dollár
Ross B. Girshick
72
774
0
28 Jun 2021
The Principles of Deep Learning Theory
The Principles of Deep Learning Theory
Daniel A. Roberts
Sho Yaida
Boris Hanin
FaMLPINNGNN
78
246
0
18 Jun 2021
Scaling Vision Transformers
Scaling Vision Transformers
Xiaohua Zhai
Alexander Kolesnikov
N. Houlsby
Lucas Beyer
ViT
150
1,096
0
08 Jun 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
684
41,563
0
22 Oct 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
908
42,520
0
28 May 2020
Recipes for building an open-domain chatbot
Recipes for building an open-domain chatbot
Stephen Roller
Emily Dinan
Naman Goyal
Da Ju
Mary Williamson
...
Myle Ott
Kurt Shuster
Eric Michael Smith
Y-Lan Boureau
Jason Weston
ALM
127
1,015
0
28 Apr 2020
GLU Variants Improve Transformer
GLU Variants Improve Transformer
Noam M. Shazeer
156
1,024
0
12 Feb 2020
Unsupervised Cross-lingual Representation Learning at Scale
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Edouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
228
6,598
0
05 Nov 2019
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
  Generation, Translation, and Comprehension
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
M. Lewis
Yinhan Liu
Naman Goyal
Marjan Ghazvininejad
Abdel-rahman Mohamed
Omer Levy
Veselin Stoyanov
Luke Zettlemoyer
AIMatVLM
266
10,880
0
29 Oct 2019
RandAugment: Practical automated data augmentation with a reduced search
  space
RandAugment: Practical automated data augmentation with a reduced search space
E. D. Cubuk
Barret Zoph
Jonathon Shlens
Quoc V. Le
MQ
278
3,508
0
30 Sep 2019
Asymptotics of Wide Networks from Feynman Diagrams
Asymptotics of Wide Networks from Feynman Diagrams
Ethan Dyer
Guy Gur-Ari
98
115
0
25 Sep 2019
Finite Depth and Width Corrections to the Neural Tangent Kernel
Finite Depth and Width Corrections to the Neural Tangent Kernel
Boris Hanin
Mihai Nica
MDE
87
152
0
13 Sep 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
703
24,572
0
26 Jul 2019
CutMix: Regularization Strategy to Train Strong Classifiers with
  Localizable Features
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Sangdoo Yun
Dongyoon Han
Seong Joon Oh
Sanghyuk Chun
Junsuk Choe
Y. Yoo
OOD
629
4,814
0
13 May 2019
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
Myle Ott
Sergey Edunov
Alexei Baevski
Angela Fan
Sam Gross
Nathan Ng
David Grangier
Michael Auli
VLMFaML
132
3,159
0
01 Apr 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
95,324
0
11 Oct 2018
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot
Franck Gabriel
Clément Hongler
284
3,225
0
20 Jun 2018
A Simple Method for Commonsense Reasoning
A Simple Method for Commonsense Reasoning
Trieu H. Trinh
Quoc V. Le
LRMReLM
102
434
0
07 Jun 2018
Gaussian Process Behaviour in Wide Deep Neural Networks
Gaussian Process Behaviour in Wide Deep Neural Networks
A. G. Matthews
Mark Rowland
Jiri Hron
Richard Turner
Zoubin Ghahramani
BDL
168
561
0
30 Apr 2018
Deep Neural Networks as Gaussian Processes
Deep Neural Networks as Gaussian Processes
Jaehoon Lee
Yasaman Bahri
Roman Novak
S. Schoenholz
Jeffrey Pennington
Jascha Narain Sohl-Dickstein
UQCVBDL
141
1,100
0
01 Nov 2017
mixup: Beyond Empirical Risk Minimization
mixup: Beyond Empirical Risk Minimization
Hongyi Zhang
Moustapha Cissé
Yann N. Dauphin
David Lopez-Paz
NoLa
318
9,815
0
25 Oct 2017
Mixed Precision Training
Mixed Precision Training
Paulius Micikevicius
Sharan Narang
Jonah Alben
G. Diamos
Erich Elsen
...
Boris Ginsburg
Michael Houston
Oleksii Kuchaiev
Ganesh Venkatesh
Hao Wu
180
1,806
0
10 Oct 2017
Random Erasing Data Augmentation
Random Erasing Data Augmentation
Zhun Zhong
Liang Zheng
Guoliang Kang
Shaozi Li
Yi Yang
116
3,652
0
16 Aug 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
819
132,725
0
12 Jun 2017
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal
Piotr Dollár
Ross B. Girshick
P. Noordhuis
Lukasz Wesolowski
Aapo Kyrola
Andrew Tulloch
Yangqing Jia
Kaiming He
3DH
132
3,688
0
08 Jun 2017
Deep Information Propagation
Deep Information Propagation
S. Schoenholz
Justin Gilmer
Surya Ganguli
Jascha Narain Sohl-Dickstein
92
371
0
04 Nov 2016
Using the Output Embedding to Improve Language Models
Using the Output Embedding to Improve Language Models
Ofir Press
Lior Wolf
104
738
0
20 Aug 2016
SGDR: Stochastic Gradient Descent with Warm Restarts
SGDR: Stochastic Gradient Descent with Warm Restarts
I. Loshchilov
Frank Hutter
ODL
356
8,190
0
13 Aug 2016
Layer Normalization
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
437
10,548
0
21 Jul 2016
Exponential expressivity in deep neural networks through transient chaos
Exponential expressivity in deep neural networks through transient chaos
Ben Poole
Subhaneil Lahiri
M. Raghu
Jascha Narain Sohl-Dickstein
Surya Ganguli
100
596
0
16 Jun 2016
On the Expressive Power of Deep Neural Networks
On the Expressive Power of Deep Neural Networks
M. Raghu
Ben Poole
Jon M. Kleinberg
Surya Ganguli
Jascha Narain Sohl-Dickstein
84
791
0
16 Jun 2016
Deep Networks with Stochastic Depth
Deep Networks with Stochastic Depth
Gao Huang
Yu Sun
Zhuang Liu
Daniel Sedra
Kilian Q. Weinberger
217
2,365
0
30 Mar 2016
Rethinking the Inception Architecture for Computer Vision
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy
Vincent Vanhoucke
Sergey Ioffe
Jonathon Shlens
Z. Wojna
3DVBDL
886
27,444
0
02 Dec 2015
Aligning Books and Movies: Towards Story-like Visual Explanations by
  Watching Movies and Reading Books
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Yukun Zhu
Ryan Kiros
R. Zemel
Ruslan Salakhutdinov
R. Urtasun
Antonio Torralba
Sanja Fidler
142
2,555
0
22 Jun 2015
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
2.1K
150,433
0
22 Dec 2014
ImageNet Large Scale Visual Recognition Challenge
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLMObjD
1.7K
39,637
0
01 Sep 2014
1