ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.03404
  4. Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
ArXivPDFHTML

Papers citing "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"

50 / 238 papers shown
Title
Representation Deficiency in Masked Language Modeling
Representation Deficiency in Masked Language Modeling
Yu Meng
Jitin Krishnan
Sinong Wang
Qifan Wang
Yuning Mao
Han Fang
Marjan Ghazvininejad
Jiawei Han
Luke Zettlemoyer
87
7
0
04 Feb 2023
When Layers Play the Lottery, all Tickets Win at Initialization
When Layers Play the Lottery, all Tickets Win at Initialization
Artur Jordão
George Correa de Araujo
H. Maia
Hélio Pedrini
13
3
0
25 Jan 2023
A Close Look at Spatial Modeling: From Attention to Convolution
A Close Look at Spatial Modeling: From Attention to Convolution
Xu Ma
Huan Wang
Can Qin
Kunpeng Li
Xing Zhao
Jie Fu
Yun Fu
ViT
3DPC
25
11
0
23 Dec 2022
EIT: Enhanced Interactive Transformer
EIT: Enhanced Interactive Transformer
Tong Zheng
Bei Li
Huiwen Bao
Tong Xiao
Jingbo Zhu
32
2
0
20 Dec 2022
Non-equispaced Fourier Neural Solvers for PDEs
Non-equispaced Fourier Neural Solvers for PDEs
Haitao Lin
Lirong Wu
Yongjie Xu
Yufei Huang
Siyuan Li
Guojiang Zhao
Z. Stan
22
7
0
09 Dec 2022
A K-variate Time Series Is Worth K Words: Evolution of the Vanilla
  Transformer Architecture for Long-term Multivariate Time Series Forecasting
A K-variate Time Series Is Worth K Words: Evolution of the Vanilla Transformer Architecture for Long-term Multivariate Time Series Forecasting
Zanwei Zhou
Rui-Ming Zhong
Chen Yang
Yan Wang
Xiaokang Yang
Wei Shen
AI4TS
45
9
0
06 Dec 2022
Spatial-Spectral Transformer for Hyperspectral Image Denoising
Spatial-Spectral Transformer for Hyperspectral Image Denoising
Miaoyu Li
Ying Fu
Yulun Zhang
26
68
0
25 Nov 2022
Beyond Attentive Tokens: Incorporating Token Importance and Diversity
  for Efficient Vision Transformers
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Sifan Long
Z. Zhao
Jimin Pi
Sheng-sheng Wang
Jingdong Wang
22
29
0
21 Nov 2022
Convexifying Transformers: Improving optimization and understanding of
  transformer networks
Convexifying Transformers: Improving optimization and understanding of transformer networks
Tolga Ergen
Behnam Neyshabur
Harsh Mehta
MLT
44
15
0
20 Nov 2022
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Xiaozhi Wang
Kaiyue Wen
Zhengyan Zhang
Lei Hou
Zhiyuan Liu
Juanzi Li
MILM
MoE
27
50
0
14 Nov 2022
AD-DROP: Attribution-Driven Dropout for Robust Language Model
  Fine-Tuning
AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
Tao Yang
Jinghao Deng
Xiaojun Quan
Qifan Wang
Shaoliang Nie
32
3
0
12 Oct 2022
SML:Enhance the Network Smoothness with Skip Meta Logit for CTR
  Prediction
SML:Enhance the Network Smoothness with Skip Meta Logit for CTR Prediction
Wenlong Deng
Lang Lang
Ziqiang Liu
B. Liu
26
0
0
09 Oct 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
460
0
24 Sep 2022
On The Computational Complexity of Self-Attention
On The Computational Complexity of Self-Attention
Feyza Duman Keles
Pruthuvi Maheshakya Wijewardena
C. Hegde
70
108
0
11 Sep 2022
Pre-Training a Graph Recurrent Network for Language Representation
Pre-Training a Graph Recurrent Network for Language Representation
Yile Wang
Linyi Yang
Zhiyang Teng
M. Zhou
Yue Zhang
GNN
38
1
0
08 Sep 2022
Addressing Token Uniformity in Transformers via Singular Value
  Transformation
Addressing Token Uniformity in Transformers via Singular Value Transformation
Hanqi Yan
Lin Gui
Wenjie Li
Yulan He
26
14
0
24 Aug 2022
Exploring Generative Neural Temporal Point Process
Exploring Generative Neural Temporal Point Process
Haitao Lin
Lirong Wu
Guojiang Zhao
Pai Liu
Stan Z. Li
DiffM
15
25
0
03 Aug 2022
Neural Knowledge Bank for Pretrained Transformers
Neural Knowledge Bank for Pretrained Transformers
Damai Dai
Wen-Jie Jiang
Qingxiu Dong
Yajuan Lyu
Qiaoqiao She
Zhifang Sui
KELM
26
21
0
31 Jul 2022
EATFormer: Improving Vision Transformer Inspired by Evolutionary
  Algorithm
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
Jiangning Zhang
Xiangtai Li
Yabiao Wang
Chengjie Wang
Yibo Yang
Yong Liu
Dacheng Tao
ViT
34
32
0
19 Jun 2022
Rank Diminishing in Deep Neural Networks
Rank Diminishing in Deep Neural Networks
Ruili Feng
Kecheng Zheng
Yukun Huang
Deli Zhao
Michael I. Jordan
Zhengjun Zha
31
28
0
13 Jun 2022
Signal Propagation in Transformers: Theoretical Perspectives and the
  Role of Rank Collapse
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Lorenzo Noci
Sotiris Anagnostidis
Luca Biggio
Antonio Orvieto
Sidak Pal Singh
Aurelien Lucchi
61
65
0
07 Jun 2022
Vision GNN: An Image is Worth Graph of Nodes
Vision GNN: An Image is Worth Graph of Nodes
Kai Han
Yunhe Wang
Jianyuan Guo
Yehui Tang
Enhua Wu
GNN
3DH
15
352
0
01 Jun 2022
Universal Deep GNNs: Rethinking Residual Connection in GNNs from a Path
  Decomposition Perspective for Preventing the Over-smoothing
Universal Deep GNNs: Rethinking Residual Connection in GNNs from a Path Decomposition Perspective for Preventing the Over-smoothing
Jie Chen
Weiqi Liu
Zhizhong Huang
Junbin Gao
Junping Zhang
Jian Pu
26
3
0
30 May 2022
Learning Locality and Isotropy in Dialogue Modeling
Learning Locality and Isotropy in Dialogue Modeling
Han Wu
Hao Hao Tan
Mingjie Zhan
Gangming Zhao
Shaoqing Lu
Ding Liang
Linqi Song
39
2
0
29 May 2022
AdaptFormer: Adapting Vision Transformers for Scalable Visual
  Recognition
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition
Shoufa Chen
Chongjian Ge
Zhan Tong
Jiangliu Wang
Yibing Song
Jue Wang
Ping Luo
149
638
0
26 May 2022
Your Transformer May Not be as Powerful as You Expect
Your Transformer May Not be as Powerful as You Expect
Shengjie Luo
Shanda Li
Shuxin Zheng
Tie-Yan Liu
Liwei Wang
Di He
63
51
0
26 May 2022
On Bridging the Gap between Mean Field and Finite Width in Deep Random
  Neural Networks with Batch Normalization
On Bridging the Gap between Mean Field and Finite Width in Deep Random Neural Networks with Batch Normalization
Amir Joudaki
Hadi Daneshmand
Francis R. Bach
AI4CE
19
2
0
25 May 2022
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
Giovanni Puccetti
Anna Rogers
Aleksandr Drozd
F. Dell’Orletta
76
42
0
23 May 2022
A Study on Transformer Configuration and Training Objective
A Study on Transformer Configuration and Training Objective
Fuzhao Xue
Jianghai Chen
Aixin Sun
Xiaozhe Ren
Zangwei Zheng
Xiaoxin He
Yongming Chen
Xin Jiang
Yang You
33
7
0
21 May 2022
Exploring Extreme Parameter Compression for Pre-trained Language Models
Exploring Extreme Parameter Compression for Pre-trained Language Models
Yuxin Ren
Benyou Wang
Lifeng Shang
Xin Jiang
Qun Liu
28
18
0
20 May 2022
Causal Transformer for Estimating Counterfactual Outcomes
Causal Transformer for Estimating Counterfactual Outcomes
Valentyn Melnychuk
Dennis Frauen
Stefan Feuerriegel
CML
33
91
0
14 Apr 2022
Exploiting Temporal Relations on Radar Perception for Autonomous Driving
Exploiting Temporal Relations on Radar Perception for Autonomous Driving
Peizhao Li
Puzuo Wang
K. Berntorp
Hongfu Liu
21
43
0
03 Apr 2022
Training-free Transformer Architecture Search
Training-free Transformer Architecture Search
Qinqin Zhou
Kekai Sheng
Xiawu Zheng
Ke Li
Xing Sun
Yonghong Tian
Jie Chen
Rongrong Ji
ViT
32
46
0
23 Mar 2022
Unified Visual Transformer Compression
Unified Visual Transformer Compression
Shixing Yu
Tianlong Chen
Jiayi Shen
Huan Yuan
Jianchao Tan
Sen Yang
Ji Liu
Zhangyang Wang
ViT
22
92
0
15 Mar 2022
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Xiaohan Ding
Xinming Zhang
Yi Zhou
Jungong Han
Guiguang Ding
Jian Sun
VLM
49
528
0
13 Mar 2022
The Principle of Diversity: Training Stronger Vision Transformers Calls
  for Reducing All Levels of Redundancy
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
Tianlong Chen
Zhenyu (Allen) Zhang
Yu Cheng
Ahmed Hassan Awadallah
Zhangyang Wang
ViT
41
37
0
12 Mar 2022
Block-Recurrent Transformers
Block-Recurrent Transformers
DeLesley S. Hutchins
Imanol Schlag
Yuhuai Wu
Ethan Dyer
Behnam Neyshabur
23
94
0
11 Mar 2022
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
  Analysis: From Theory to Practice
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
Peihao Wang
Wenqing Zheng
Tianlong Chen
Zhangyang Wang
ViT
24
127
0
09 Mar 2022
Sky Computing: Accelerating Geo-distributed Computing in Federated
  Learning
Sky Computing: Accelerating Geo-distributed Computing in Federated Learning
Jie Zhu
Shenggui Li
Yang You
FedML
16
5
0
24 Feb 2022
Revisiting Over-smoothing in BERT from the Perspective of Graph
Revisiting Over-smoothing in BERT from the Perspective of Graph
Han Shi
Jiahui Gao
Hang Xu
Xiaodan Liang
Zhenguo Li
Lingpeng Kong
Stephen M. S. Lee
James T. Kwok
22
71
0
17 Feb 2022
The Quarks of Attention
The Quarks of Attention
Pierre Baldi
Roman Vershynin
GNN
16
9
0
15 Feb 2022
On the Origins of the Block Structure Phenomenon in Neural Network
  Representations
On the Origins of the Block Structure Phenomenon in Neural Network Representations
Thao Nguyen
M. Raghu
Simon Kornblith
25
14
0
15 Feb 2022
Video Transformers: A Survey
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
22
103
0
16 Jan 2022
A Survey of Controllable Text Generation using Transformer-based
  Pre-trained Language Models
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
Hanqing Zhang
Haolin Song
Shaoyu Li
Ming Zhou
Dawei Song
49
214
0
14 Jan 2022
Miti-DETR: Object Detection based on Transformers with Mitigatory
  Self-Attention Convergence
Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence
Wenchi Ma
Tianxiao Zhang
Guanghui Wang
ViT
36
14
0
26 Dec 2021
Sketching as a Tool for Understanding and Accelerating Self-attention
  for Long Sequences
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences
Yifan Chen
Qi Zeng
Dilek Z. Hakkani-Tür
Di Jin
Heng Ji
Yun Yang
25
4
0
10 Dec 2021
Dynamic Graph Learning-Neural Network for Multivariate Time Series
  Modeling
Dynamic Graph Learning-Neural Network for Multivariate Time Series Modeling
Zhuoling Li
Gaowei Zhang
Lingyu Xu
Jie Yu
AI4TS
16
2
0
06 Dec 2021
Graph Conditioned Sparse-Attention for Improved Source Code
  Understanding
Graph Conditioned Sparse-Attention for Improved Source Code Understanding
Junyan Cheng
Iordanis Fostiropoulos
Barry W. Boehm
19
1
0
01 Dec 2021
Pruning Self-attentions into Convolutional Layers in Single Path
Pruning Self-attentions into Convolutional Layers in Single Path
Haoyu He
Jianfei Cai
Jing Liu
Zizheng Pan
Jing Zhang
Dacheng Tao
Bohan Zhuang
ViT
34
40
0
23 Nov 2021
MetaFormer Is Actually What You Need for Vision
MetaFormer Is Actually What You Need for Vision
Weihao Yu
Mi Luo
Pan Zhou
Chenyang Si
Yichen Zhou
Xinchao Wang
Jiashi Feng
Shuicheng Yan
31
874
0
22 Nov 2021
Previous
12345
Next