Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.10609
Cited By
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
14 October 2024
Federico Arangath Joseph
Jerome Sieber
Melanie Zeilinger
Carmen Amo Alonso
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Lambda-Skip Connections: the architectural component that prevents Rank Collapse"
39 / 39 papers shown
Title
State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era
Matteo Tiezzi
Michele Casoni
Alessandro Betti
Marco Gori
S. Melacci
25
3
0
13 Jun 2024
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao
Albert Gu
Mamba
100
502
0
31 May 2024
On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu
A. Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie
80
12
0
29 May 2024
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks
Jerome Sieber
Carmen Amo Alonso
A. Didier
Melanie Zeilinger
Antonio Orvieto
AAML
110
8
0
24 May 2024
The Hidden Attention of Mamba Models
Ameen Ali
Itamar Zimerman
Lior Wolf
Mamba
87
63
0
03 Mar 2024
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu
Tri Dao
Mamba
146
2,670
0
01 Dec 2023
Demystifying Oversmoothing in Attention-Based Graph Neural Networks
Xinyi Wu
A. Ajorlou
Zihui Wu
Ali Jadbabaie
81
43
0
25 May 2023
On the Expressivity Role of LayerNorm in Transformers' Attention
Shaked Brody
Shiyu Jin
Xinghao Zhu
MoE
90
32
0
04 May 2023
Resurrecting Recurrent Neural Networks for Long Sequences
Antonio Orvieto
Samuel L. Smith
Albert Gu
Anushan Fernando
Çağlar Gülçehre
Razvan Pascanu
Soham De
326
287
0
11 Mar 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
63
30
0
20 Feb 2023
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies
Hongyu Hè
Marko Kabić
44
2
0
13 Feb 2023
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Daniel Y. Fu
Tri Dao
Khaled Kamal Saab
A. Thomas
Atri Rudra
Christopher Ré
112
396
0
28 Dec 2022
On the Parameterization and Initialization of Diagonal State Space Models
Albert Gu
Ankit Gupta
Karan Goel
Christopher Ré
84
315
0
23 Jun 2022
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Lorenzo Noci
Sotiris Anagnostidis
Luca Biggio
Antonio Orvieto
Sidak Pal Singh
Aurelien Lucchi
87
72
0
07 Jun 2022
Revisiting Over-smoothing in BERT from the Perspective of Graph
Han Shi
Jiahui Gao
Hang Xu
Xiaodan Liang
Zhenguo Li
Lingpeng Kong
Stephen M. S. Lee
James T. Kwok
67
74
0
17 Feb 2022
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu
Karan Goel
Christopher Ré
211
1,773
0
31 Oct 2021
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers
Albert Gu
Isys Johnson
Karan Goel
Khaled Kamal Saab
Tri Dao
Atri Rudra
Christopher Ré
118
595
0
26 Oct 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
118
385
0
05 Mar 2021
Long Range Arena: A Benchmark for Efficient Transformers
Yi Tay
Mostafa Dehghani
Samira Abnar
Songlin Yang
Dara Bahri
Philip Pham
J. Rao
Liu Yang
Sebastian Ruder
Donald Metzler
136
718
0
08 Nov 2020
HiPPO: Recurrent Memory with Optimal Polynomial Projections
Albert Gu
Tri Dao
Stefano Ermon
Atri Rudra
Christopher Ré
110
517
0
17 Aug 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
201
1,765
0
29 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
749
41,932
0
28 May 2020
ReZero is All You Need: Fast Convergence at Large Depth
Thomas C. Bachlechner
Bodhisattwa Prasad Majumder
H. H. Mao
G. Cottrell
Julian McAuley
AI4CE
74
281
0
10 Mar 2020
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
128
989
0
12 Feb 2020
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
SSL
AIMat
363
6,449
0
26 Sep 2019
Graph Neural Networks: A Review of Methods and Applications
Jie Zhou
Ganqu Cui
Shengding Hu
Zhengyan Zhang
Cheng Yang
Zhiyuan Liu
Lifeng Wang
Changcheng Li
Maosong Sun
AI4CE
GNN
1.1K
5,515
0
20 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.7K
94,770
0
11 Oct 2018
Deep Learning using Rectified Linear Units (ReLU)
Abien Fred Agarap
58
3,223
0
22 Mar 2018
Visualizing the Loss Landscape of Neural Nets
Hao Li
Zheng Xu
Gavin Taylor
Christoph Studer
Tom Goldstein
243
1,890
0
28 Dec 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
692
131,526
0
12 Jun 2017
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
David Balduzzi
Marcus Frean
Lennox Leary
J. P. Lewis
Kurt Wan-Duo Ma
Brian McWilliams
ODL
68
404
0
28 Feb 2017
An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data
Sijie Song
Cuiling Lan
Junliang Xing
Wenjun Zeng
Jiaying Liu
184
987
0
18 Nov 2016
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
410
10,482
0
21 Jul 2016
Identity Mappings in Deep Residual Networks
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
354
10,182
0
16 Mar 2016
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.2K
193,878
0
10 Dec 2015
Effective Approaches to Attention-based Neural Machine Translation
Thang Luong
Hieu H. Pham
Christopher D. Manning
380
7,962
0
17 Aug 2015
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe
Christian Szegedy
OOD
463
43,289
0
11 Feb 2015
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Junyoung Chung
Çağlar Gülçehre
Kyunghyun Cho
Yoshua Bengio
581
12,704
0
11 Dec 2014
Deep Networks with Internal Selective Attention through Feedback Connections
Marijn F. Stollenga
Jonathan Masci
Faustino J. Gomez
Jürgen Schmidhuber
160
257
0
11 Jul 2014
1