Lambda-Skip Connections: the architectural component that prevents Rank Collapse

14 October 2024

Federico Arangath Joseph

Papers citing "Lambda-Skip Connections: the architectural component that prevents Rank Collapse"

39 / 39 papers shown

Title
State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era Matteo Tiezzi Michele Casoni Alessandro Betti Marco Gori S. Melacci 25 3 0 13 Jun 2024
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Tri Dao Albert Gu Mamba 100 502 0 31 May 2024
On the Role of Attention Masks and LayerNorm in Transformers Xinyi Wu A. Ajorlou Yifei Wang Stefanie Jegelka Ali Jadbabaie 80 12 0 29 May 2024
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks Jerome Sieber Carmen Amo Alonso A. Didier Melanie Zeilinger Antonio Orvieto AAML 110 8 0 24 May 2024
The Hidden Attention of Mamba Models Ameen Ali Itamar Zimerman Lior Wolf Mamba 87 63 0 03 Mar 2024
Mamba: Linear-Time Sequence Modeling with Selective State Spaces Albert Gu Tri Dao Mamba 146 2,670 0 01 Dec 2023
Demystifying Oversmoothing in Attention-Based Graph Neural Networks Xinyi Wu A. Ajorlou Zihui Wu Ali Jadbabaie 81 43 0 25 May 2023
On the Expressivity Role of LayerNorm in Transformers' Attention Shaked Brody Shiyu Jin Xinghao Zhu MoE 90 32 0 04 May 2023
Resurrecting Recurrent Neural Networks for Long Sequences Antonio Orvieto Samuel L. Smith Albert Gu Anushan Fernando Çağlar Gülçehre Razvan Pascanu Soham De 326 287 0 11 Mar 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation Bobby He James Martens Guodong Zhang Aleksandar Botev Andy Brock Samuel L. Smith Yee Whye Teh 63 30 0 20 Feb 2023
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies Hongyu Hè Marko Kabić 44 2 0 13 Feb 2023
Hungry Hungry Hippos: Towards Language Modeling with State Space Models Daniel Y. Fu Tri Dao Khaled Kamal Saab A. Thomas Atri Rudra Christopher Ré 112 396 0 28 Dec 2022
On the Parameterization and Initialization of Diagonal State Space Models Albert Gu Ankit Gupta Karan Goel Christopher Ré 84 315 0 23 Jun 2022
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse Lorenzo Noci Sotiris Anagnostidis Luca Biggio Antonio Orvieto Sidak Pal Singh Aurelien Lucchi 87 72 0 07 Jun 2022
Revisiting Over-smoothing in BERT from the Perspective of Graph Han Shi Jiahui Gao Hang Xu Xiaodan Liang Zhenguo Li Lingpeng Kong Stephen M. S. Lee James T. Kwok 67 74 0 17 Feb 2022
Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu Karan Goel Christopher Ré 211 1,773 0 31 Oct 2021
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers Albert Gu Isys Johnson Karan Goel Khaled Kamal Saab Tri Dao Atri Rudra Christopher Ré 118 595 0 26 Oct 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth Yihe Dong Jean-Baptiste Cordonnier Andreas Loukas 118 385 0 05 Mar 2021
Long Range Arena: A Benchmark for Efficient Transformers Yi Tay Mostafa Dehghani Samira Abnar Songlin Yang Dara Bahri Philip Pham J. Rao Liu Yang Sebastian Ruder Donald Metzler 136 718 0 08 Nov 2020
HiPPO: Recurrent Memory with Optimal Polynomial Projections Albert Gu Tri Dao Stefano Ermon Atri Rudra Christopher Ré 110 517 0 17 Aug 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos Katharopoulos Apoorv Vyas Nikolaos Pappas Franccois Fleuret 201 1,765 0 29 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 749 41,932 0 28 May 2020
ReZero is All You Need: Fast Convergence at Large Depth Thomas C. Bachlechner Bodhisattwa Prasad Majumder H. H. Mao G. Cottrell Julian McAuley AI4CE 74 281 0 10 Mar 2020
On Layer Normalization in the Transformer Architecture Ruibin Xiong Yunchang Yang Di He Kai Zheng Shuxin Zheng Chen Xing Huishuai Zhang Yanyan Lan Liwei Wang Tie-Yan Liu AI4CE 128 989 0 12 Feb 2020
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Zhenzhong Lan Mingda Chen Sebastian Goodman Kevin Gimpel Piyush Sharma Radu Soricut SSL AIMat 363 6,449 0 26 Sep 2019
Graph Neural Networks: A Review of Methods and Applications Jie Zhou Ganqu Cui Shengding Hu Zhengyan Zhang Cheng Yang Zhiyuan Liu Lifeng Wang Changcheng Li Maosong Sun AI4CE GNN 1.1K 5,515 0 20 Dec 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 1.7K 94,770 0 11 Oct 2018
Deep Learning using Rectified Linear Units (ReLU) Abien Fred Agarap 58 3,223 0 22 Mar 2018
Visualizing the Loss Landscape of Neural Nets Hao Li Zheng Xu Gavin Taylor Christoph Studer Tom Goldstein 243 1,890 0 28 Dec 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 692 131,526 0 12 Jun 2017
The Shattered Gradients Problem: If resnets are the answer, then what is the question? David Balduzzi Marcus Frean Lennox Leary J. P. Lewis Kurt Wan-Duo Ma Brian McWilliams ODL 68 404 0 28 Feb 2017
An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data Sijie Song Cuiling Lan Junliang Xing Wenjun Zeng Jiaying Liu 184 987 0 18 Nov 2016
Layer Normalization Jimmy Lei Ba J. Kiros Geoffrey E. Hinton 410 10,482 0 21 Jul 2016
Identity Mappings in Deep Residual Networks Kaiming He Xinming Zhang Shaoqing Ren Jian Sun 354 10,182 0 16 Mar 2016
Deep Residual Learning for Image Recognition Kaiming He Xinming Zhang Shaoqing Ren Jian Sun MedIm 2.2K 193,878 0 10 Dec 2015
Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu H. Pham Christopher D. Manning 380 7,962 0 17 Aug 2015
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe Christian Szegedy OOD 463 43,289 0 11 Feb 2015
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Junyoung Chung Çağlar Gülçehre Kyunghyun Cho Yoshua Bengio 581 12,704 0 11 Dec 2014
Deep Networks with Internal Selective Attention through Feedback Connections Marijn F. Stollenga Jonathan Masci Faustino J. Gomez Jürgen Schmidhuber 160 257 0 11 Jul 2014