ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2002.04745
  4. Cited By
On Layer Normalization in the Transformer Architecture

On Layer Normalization in the Transformer Architecture

12 February 2020
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
    AI4CE
ArXivPDFHTML

Papers citing "On Layer Normalization in the Transformer Architecture"

50 / 566 papers shown
Title
Federated Learning with Dynamic Transformer for Text to Speech
Federated Learning with Dynamic Transformer for Text to Speech
Zhenhou Hong
Jianzong Wang
Xiaoyang Qu
Jie Liu
Chendong Zhao
Jing Xiao
FedML
9
14
0
09 Jul 2021
Improved Language Identification Through Cross-Lingual Self-Supervised
  Learning
Improved Language Identification Through Cross-Lingual Self-Supervised Learning
Andros Tjandra
Diptanu Gon Choudhury
Frank Zhang
Kritika Singh
Alexis Conneau
Alexei Baevski
Assaf Sela
Yatharth Saraf
Michael Auli
VLM
SSL
24
35
0
08 Jul 2021
Long-Short Transformer: Efficient Transformers for Language and Vision
Long-Short Transformer: Efficient Transformers for Language and Vision
Chen Zhu
Ming-Yu Liu
Chaowei Xiao
M. Shoeybi
Tom Goldstein
Anima Anandkumar
Bryan Catanzaro
ViT
VLM
32
130
0
05 Jul 2021
Stabilizing Equilibrium Models by Jacobian Regularization
Stabilizing Equilibrium Models by Jacobian Regularization
Shaojie Bai
V. Koltun
J. Zico Kolter
33
57
0
28 Jun 2021
Self-Attentive Ensemble Transformer: Representing Ensemble Interactions
  in Neural Networks for Earth System Models
Self-Attentive Ensemble Transformer: Representing Ensemble Interactions in Neural Networks for Earth System Models
Tobias S. Finn
17
10
0
21 Jun 2021
Multi-head or Single-head? An Empirical Comparison for Transformer
  Training
Multi-head or Single-head? An Empirical Comparison for Transformer Training
Liyuan Liu
Jialu Liu
Jiawei Han
23
32
0
17 Jun 2021
Global Rhythm Style Transfer Without Text Transcriptions
Global Rhythm Style Transfer Without Text Transcriptions
Kaizhi Qian
Yang Zhang
Shiyu Chang
Jinjun Xiong
Chuang Gan
David D. Cox
M. Hasegawa-Johnson
30
20
0
16 Jun 2021
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
  Structures
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures
Ivan Chelombiev
Daniel Justus
Douglas Orr
A. Dietrich
Frithjof Gressmann
A. Koliousis
Carlo Luschi
27
5
0
10 Jun 2021
Do Transformers Really Perform Bad for Graph Representation?
Do Transformers Really Perform Bad for Graph Representation?
Chengxuan Ying
Tianle Cai
Shengjie Luo
Shuxin Zheng
Guolin Ke
Di He
Yanming Shen
Tie-Yan Liu
GNN
48
435
0
09 Jun 2021
A Survey of Transformers
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
53
1,089
0
08 Jun 2021
Luna: Linear Unified Nested Attention
Luna: Linear Unified Nested Attention
Xuezhe Ma
Xiang Kong
Sinong Wang
Chunting Zhou
Jonathan May
Hao Ma
Luke Zettlemoyer
33
114
0
03 Jun 2021
Towards Deeper Deep Reinforcement Learning with Spectral Normalization
Towards Deeper Deep Reinforcement Learning with Spectral Normalization
Johan Bjorck
Carla P. Gomes
Kilian Q. Weinberger
19
23
0
02 Jun 2021
Choose a Transformer: Fourier or Galerkin
Choose a Transformer: Fourier or Galerkin
Shuhao Cao
42
227
0
31 May 2021
StyTr$^2$: Image Style Transfer with Transformers
StyTr2^22: Image Style Transfer with Transformers
Yingying Deng
Fan Tang
Weiming Dong
Chongyang Ma
Xingjia Pan
Lei Wang
Changsheng Xu
ViT
33
260
0
30 May 2021
Fast Nearest Neighbor Machine Translation
Fast Nearest Neighbor Machine Translation
Yuxian Meng
Xiaoya Li
Xiayu Zheng
Fei Wu
Xiaofei Sun
Tianwei Zhang
Jiwei Li
LRM
19
49
0
30 May 2021
Learning to Extend Program Graphs to Work-in-Progress Code
Learning to Extend Program Graphs to Work-in-Progress Code
Xuechen Li
Chris J. Maddison
Daniel Tarlow
21
2
0
28 May 2021
CogView: Mastering Text-to-Image Generation via Transformers
CogView: Mastering Text-to-Image Generation via Transformers
Ming Ding
Zhuoyi Yang
Wenyi Hong
Wendi Zheng
Chang Zhou
...
Junyang Lin
Xu Zou
Zhou Shao
Hongxia Yang
Jie Tang
ViT
VLM
45
766
0
26 May 2021
Rethinking Skip Connection with Layer Normalization in Transformers and
  ResNets
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets
Fenglin Liu
Xuancheng Ren
Zhiyuan Zhang
Xu Sun
Yuexian Zou
AI4CE
34
67
0
15 May 2021
BERT Busters: Outlier Dimensions that Disrupt Transformers
BERT Busters: Outlier Dimensions that Disrupt Transformers
Olga Kovaleva
Saurabh Kulshreshtha
Anna Rogers
Anna Rumshisky
24
85
0
14 May 2021
How could Neural Networks understand Programs?
How could Neural Networks understand Programs?
Dinglan Peng
Shuxin Zheng
Yatao Li
Guolin Ke
Di He
Tie-Yan Liu
NAI
20
62
0
10 May 2021
PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language
  Models with Auto-parallel Computation
PanGu-ααα: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
Wei Zeng
Xiaozhe Ren
Teng Su
Hui Wang
Yi-Lun Liao
...
Gaojun Fan
Yaowei Wang
Xuefeng Jin
Qun Liu
Yonghong Tian
ALM
MoE
AI4CE
35
212
0
26 Apr 2021
Temporal Query Networks for Fine-grained Video Understanding
Temporal Query Networks for Fine-grained Video Understanding
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
24
83
0
19 Apr 2021
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training
  with LAMB's Convergence Speed
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
Conglong Li
A. A. Awan
Hanlin Tang
Samyam Rajbhandari
Yuxiong He
50
33
0
13 Apr 2021
Lessons on Parameter Sharing across Layers in Transformers
Lessons on Parameter Sharing across Layers in Transformers
Sho Takase
Shun Kiyono
25
85
0
13 Apr 2021
BASE Layers: Simplifying Training of Large, Sparse Models
BASE Layers: Simplifying Training of Large, Sparse Models
M. Lewis
Shruti Bhosale
Tim Dettmers
Naman Goyal
Luke Zettlemoyer
MoE
38
274
0
30 Mar 2021
A Practical Survey on Faster and Lighter Transformers
A Practical Survey on Faster and Lighter Transformers
Quentin Fournier
G. Caron
Daniel Aloise
14
93
0
26 Mar 2021
Generative Chemical Transformer: Neural Machine Learning of Molecular
  Geometric Structures from Chemical Language via Attention
Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention
Hyunseung Kim
Jonggeol Na
Won Bo Lee
22
46
0
27 Feb 2021
When Attention Meets Fast Recurrence: Training Language Models with
  Reduced Compute
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Tao Lei
RALM
VLM
59
47
0
24 Feb 2021
Do Transformer Modifications Transfer Across Implementations and
  Applications?
Do Transformer Modifications Transfer Across Implementations and Applications?
Sharan Narang
Hyung Won Chung
Yi Tay
W. Fedus
Thibault Févry
...
Wei Li
Nan Ding
Jake Marcus
Adam Roberts
Colin Raffel
33
126
0
23 Feb 2021
GradInit: Learning to Initialize Neural Networks for Stable and
  Efficient Training
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training
Chen Zhu
Renkun Ni
Zheng Xu
Kezhi Kong
Yifan Jiang
Tom Goldstein
ODL
41
54
0
16 Feb 2021
PipeTransformer: Automated Elastic Pipelining for Distributed Training
  of Transformers
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers
Chaoyang He
Shen Li
Mahdi Soltanolkotabi
Salman Avestimehr
15
29
0
05 Feb 2021
An Efficient Transformer Decoder with Compressed Sub-layers
An Efficient Transformer Decoder with Compressed Sub-layers
Yanyang Li
Ye Lin
Tong Xiao
Jingbo Zhu
33
29
0
03 Jan 2021
RealFormer: Transformer Likes Residual Attention
RealFormer: Transformer Likes Residual Attention
Ruining He
Anirudh Ravula
Bhargav Kanagal
Joshua Ainslie
27
108
0
21 Dec 2020
Sub-Linear Memory: How to Make Performers SLiM
Sub-Linear Memory: How to Make Performers SLiM
Valerii Likhosherstov
K. Choromanski
Jared Davis
Xingyou Song
Adrian Weller
23
19
0
21 Dec 2020
Learning from Mistakes: Using Mis-predictions as Harm Alerts in Language
  Pre-Training
Learning from Mistakes: Using Mis-predictions as Harm Alerts in Language Pre-Training
Chen Xing
Wenhao Liu
Caiming Xiong
29
0
0
16 Dec 2020
Multi-Interactive Attention Network for Fine-grained Feature Learning in
  CTR Prediction
Multi-Interactive Attention Network for Fine-grained Feature Learning in CTR Prediction
Kai Zhang
Hao Qian
Daixin Wang
Qi Liu
Longfei Li
Jun Zhou
Jianhui Ma
Enhong Chen
HAI
16
49
0
13 Dec 2020
Accelerating Training of Transformer-Based Language Models with
  Progressive Layer Dropping
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Minjia Zhang
Yuxiong He
AI4CE
13
100
0
26 Oct 2020
Stabilizing Transformer-Based Action Sequence Generation For Q-Learning
Stabilizing Transformer-Based Action Sequence Generation For Q-Learning
Gideon Stein
Andrey Filchenkov
Arip Asadulaev
OffRL
29
2
0
23 Oct 2020
A General Multi-Task Learning Framework to Leverage Text Data for Speech
  to Text Tasks
A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks
Yun Tang
J. Pino
Changhan Wang
Xutai Ma
Dmitriy Genzel
26
73
0
21 Oct 2020
Is Batch Norm unique? An empirical investigation and prescription to
  emulate the best properties of common normalizers without batch dependence
Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence
Vinay Rao
Jascha Narain Sohl-Dickstein
BDL
40
4
0
21 Oct 2020
Effects of Parameter Norm Growth During Transformer Training: Inductive
  Bias from Gradient Descent
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
William Merrill
Vivek Ramanujan
Yoav Goldberg
Roy Schwartz
Noah A. Smith
AI4CE
19
36
0
19 Oct 2020
Query-Key Normalization for Transformers
Query-Key Normalization for Transformers
Alex Henry
Prudhvi Raj Dachapally
S. Pawar
Yuxuan Chen
17
77
0
08 Oct 2020
Guiding Attention for Self-Supervised Learning with Transformers
Guiding Attention for Self-Supervised Learning with Transformers
Ameet Deshpande
Karthik Narasimhan
34
21
0
06 Oct 2020
Group Equivariant Stand-Alone Self-Attention For Vision
Group Equivariant Stand-Alone Self-Attention For Vision
David W. Romero
Jean-Baptiste Cordonnier
MDE
26
58
0
02 Oct 2020
Multi-hop Attention Graph Neural Network
Multi-hop Attention Graph Neural Network
Guangtao Wang
Rex Ying
Jing Huang
J. Leskovec
17
131
0
29 Sep 2020
Normalization Techniques in Training DNNs: Methodology, Analysis and
  Application
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
Lei Huang
Jie Qin
Yi Zhou
Fan Zhu
Li Liu
Ling Shao
AI4CE
12
255
0
27 Sep 2020
GraphNorm: A Principled Approach to Accelerating Graph Neural Network
  Training
GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training
Tianle Cai
Shengjie Luo
Keyulu Xu
Di He
Tie-Yan Liu
Liwei Wang
GNN
32
160
0
07 Sep 2020
AutoTrans: Automating Transformer Design via Reinforced Architecture Search
Wei-wei Zhu
Xiaoling Wang
Xipeng Qiu
Yuan Ni
Guotong Xie
30
18
0
04 Sep 2020
MEANTIME: Mixture of Attention Mechanisms with Multi-temporal Embeddings
  for Sequential Recommendation
MEANTIME: Mixture of Attention Mechanisms with Multi-temporal Embeddings for Sequential Recommendation
S. Cho
Eunhyeok Park
S. Yoo
AI4TS
20
69
0
19 Aug 2020
Learning Interpretable Representation for Controllable Polyphonic Music
  Generation
Learning Interpretable Representation for Controllable Polyphonic Music Generation
Ziyu Wang
Dingsu Wang
Yixiao Zhang
Gus Xia
DRL
22
63
0
17 Aug 2020
Previous
123...101112
Next