On the Role of Attention Masks and LayerNorm in Transformers

29 May 2024

Papers citing "On the Role of Attention Masks and LayerNorm in Transformers"

7 / 7 papers shown

Title
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers Akiyoshi Tomihari Issei Sato ODL 61 1 0 31 Jan 2025
The Geometry of Tokens in Internal Representations of Large Language Models Karthik Viswanathan Yuri Gardinazzi Giada Panerai Alberto Cazzaniga Matteo Biagetti AIFin 94 4 0 17 Jan 2025
Lambda-Skip Connections: the architectural component that prevents Rank Collapse Federico Arangath Joseph Jerome Sieber M. Zeilinger Carmen Amo Alonso 33 0 0 14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models N. Jha Brandon Reagen OffRL AI4CE 33 0 0 12 Oct 2024
On the Expressivity Role of LayerNorm in Transformers' Attention Shaked Brody Shiyu Jin Xinghao Zhu MoE 66 30 0 04 May 2023
Big Bird: Transformers for Longer Sequences Manzil Zaheer Guru Guruganesh Kumar Avinava Dubey Joshua Ainslie Chris Alberti ... Philip Pham Anirudh Ravula Qifan Wang Li Yang Amr Ahmed VLM 285 2,015 0 28 Jul 2020
Efficient Content-Based Sparse Attention with Routing Transformers Aurko Roy M. Saffar Ashish Vaswani David Grangier MoE 252 580 0 12 Mar 2020