ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2005.09561
  4. Cited By
Normalized Attention Without Probability Cage

Normalized Attention Without Probability Cage

19 May 2020
Oliver Richter
Roger Wattenhofer
ArXivPDFHTML

Papers citing "Normalized Attention Without Probability Cage"

41 / 41 papers shown
Title
Encryption-Friendly LLM Architecture
Encryption-Friendly LLM Architecture
Donghwan Rho
Taeseong Kim
Minje Park
Jung Woo Kim
Hyunsik Chae
Jung Hee Cheon
Ernest K. Ryu
143
2
0
24 Feb 2025
Pointer Graph Networks
Pointer Graph Networks
Petar Velivcković
Lars Buesing
Matthew Overlan
Razvan Pascanu
Oriol Vinyals
Charles Blundell
GNN
51
62
0
11 Jun 2020
Lite Transformer with Long-Short Range Attention
Lite Transformer with Long-Short Range Attention
Zhanghao Wu
Zhijian Liu
Ji Lin
Chengyue Wu
Song Han
51
321
0
24 Apr 2020
Understanding the Difficulty of Training Transformers
Understanding the Difficulty of Training Transformers
Liyuan Liu
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
Jiawei Han
AI4CE
36
251
0
17 Apr 2020
Telling BERT's full story: from Local Attention to Global Aggregation
Telling BERT's full story: from Local Attention to Global Aggregation
Damian Pascual
Gino Brunner
Roger Wattenhofer
27
19
0
10 Apr 2020
ReZero is All You Need: Fast Convergence at Large Depth
ReZero is All You Need: Fast Convergence at Large Depth
Thomas C. Bachlechner
Bodhisattwa Prasad Majumder
H. H. Mao
G. Cottrell
Julian McAuley
AI4CE
52
277
0
10 Mar 2020
Coherent Gradients: An Approach to Understanding Generalization in
  Gradient Descent-based Optimization
Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
S. Chatterjee
ODL
OOD
81
51
0
25 Feb 2020
Are Transformers universal approximators of sequence-to-sequence
  functions?
Are Transformers universal approximators of sequence-to-sequence functions?
Chulhee Yun
Srinadh Bhojanapalli
A. S. Rawat
Sashank J. Reddi
Sanjiv Kumar
80
347
0
20 Dec 2019
Improving Transformer Models by Reordering their Sublayers
Improving Transformer Models by Reordering their Sublayers
Ofir Press
Noah A. Smith
Omer Levy
40
87
0
10 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
270
19,824
0
23 Oct 2019
Neural Execution of Graph Algorithms
Neural Execution of Graph Algorithms
Petar Velickovic
Rex Ying
Matilde Padovano
R. Hadsell
Charles Blundell
GNN
68
166
0
23 Oct 2019
Transformers without Tears: Improving the Normalization of
  Self-Attention
Transformers without Tears: Improving the Normalization of Self-Attention
Toan Q. Nguyen
Julian Salazar
74
229
0
14 Oct 2019
Stabilizing Transformers for Reinforcement Learning
Stabilizing Transformers for Reinforcement Learning
Emilio Parisotto
H. F. Song
Jack W. Rae
Razvan Pascanu
Çağlar Gülçehre
...
Aidan Clark
Seb Noury
M. Botvinick
N. Heess
R. Hadsell
OffRL
69
360
0
13 Oct 2019
On Universal Equivariant Set Networks
On Universal Equivariant Set Networks
Nimrod Segol
Y. Lipman
3DPC
58
63
0
06 Oct 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language
  Representations
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
SSL
AIMat
268
6,420
0
26 Sep 2019
Reducing Transformer Depth on Demand with Structured Dropout
Reducing Transformer Depth on Demand with Structured Dropout
Angela Fan
Edouard Grave
Armand Joulin
90
588
0
25 Sep 2019
Attention is not not Explanation
Attention is not not Explanation
Sarah Wiegreffe
Yuval Pinter
XAI
AAML
FAtt
48
901
0
13 Aug 2019
On Identifiability in Transformers
On Identifiability in Transformers
Gino Brunner
Yang Liu
Damian Pascual
Oliver Richter
Massimiliano Ciaramita
Roger Wattenhofer
ViT
54
188
0
12 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
408
24,160
0
26 Jul 2019
Attentive Multi-Task Deep Reinforcement Learning
Attentive Multi-Task Deep Reinforcement Learning
Timo Bram
Gino Brunner
Oliver Richter
Roger Wattenhofer
CLL
86
18
0
05 Jul 2019
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang
Zihang Dai
Yiming Yang
J. Carbonell
Ruslan Salakhutdinov
Quoc V. Le
AI4CE
183
8,386
0
19 Jun 2019
Stand-Alone Self-Attention in Vision Models
Stand-Alone Self-Attention in Vision Models
Prajit Ramachandran
Niki Parmar
Ashish Vaswani
Irwan Bello
Anselm Levskaya
Jonathon Shlens
VLM
SLR
ViT
65
1,208
0
13 Jun 2019
What Can Neural Networks Reason About?
What Can Neural Networks Reason About?
Keyulu Xu
Jingling Li
Mozhi Zhang
S. Du
Ken-ichi Kawarabayashi
Stefanie Jegelka
NAI
AI4CE
56
243
0
30 May 2019
Budgeted Training: Rethinking Deep Neural Network Training Under
  Resource Constraints
Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
Mengtian Li
Ersin Yumer
Deva Ramanan
55
47
0
12 May 2019
Attention is not Explanation
Attention is not Explanation
Sarthak Jain
Byron C. Wallace
FAtt
87
1,307
0
26 Feb 2019
Pay Less Attention with Lightweight and Dynamic Convolutions
Pay Less Attention with Lightweight and Dynamic Convolutions
Felix Wu
Angela Fan
Alexei Baevski
Yann N. Dauphin
Michael Auli
62
606
0
29 Jan 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
961
93,936
0
11 Oct 2018
How Powerful are Graph Neural Networks?
How Powerful are Graph Neural Networks?
Keyulu Xu
Weihua Hu
J. Leskovec
Stefanie Jegelka
GNN
173
7,554
0
01 Oct 2018
Universal Transformers
Universal Transformers
Mostafa Dehghani
Stephan Gouws
Oriol Vinyals
Jakob Uszkoreit
Lukasz Kaiser
64
752
0
10 Jul 2018
Self-Attention Generative Adversarial Networks
Self-Attention Generative Adversarial Networks
Han Zhang
Ian Goodfellow
Dimitris N. Metaxas
Augustus Odena
GAN
113
3,710
0
21 May 2018
Group Normalization
Group Normalization
Yuxin Wu
Kaiming He
147
3,626
0
22 Mar 2018
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
453
129,831
0
12 Jun 2017
Deep Sets
Deep Sets
Manzil Zaheer
Satwik Kottur
Siamak Ravanbakhsh
Barnabás Póczós
Ruslan Salakhutdinov
Alex Smola
225
2,441
0
10 Mar 2017
Instance Normalization: The Missing Ingredient for Fast Stylization
Instance Normalization: The Missing Ingredient for Fast Stylization
Dmitry Ulyanov
Andrea Vedaldi
Victor Lempitsky
OOD
122
3,689
0
27 Jul 2016
Layer Normalization
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
251
10,412
0
21 Jul 2016
Gaussian Error Linear Units (GELUs)
Gaussian Error Linear Units (GELUs)
Dan Hendrycks
Kevin Gimpel
159
4,958
0
27 Jun 2016
Weight Normalization: A Simple Reparameterization to Accelerate Training
  of Deep Neural Networks
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
Tim Salimans
Diederik P. Kingma
ODL
143
1,933
0
25 Feb 2016
Batch Normalization: Accelerating Deep Network Training by Reducing
  Internal Covariate Shift
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe
Christian Szegedy
OOD
328
43,154
0
11 Feb 2015
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
844
149,474
0
22 Dec 2014
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau
Kyunghyun Cho
Yoshua Bengio
AIMat
390
27,205
0
01 Sep 2014
Generating Sequences With Recurrent Neural Networks
Generating Sequences With Recurrent Neural Networks
Alex Graves
GAN
106
4,025
0
04 Aug 2013
1