On Layer Normalization in the Transformer Architecture

12 February 2020

Papers citing "On Layer Normalization in the Transformer Architecture"

50 / 566 papers shown

Title
Improving Autoregressive NLP Tasks via Modular Linearized Attention Victor Agostinelli Lizhong Chen 27 1 0 17 Apr 2023
M2T: Masking Transformers Twice for Faster Decoding Fabian Mentzer E. Agustsson Michael Tschannen 23 17 0 14 Apr 2023
Convex Dual Theory Analysis of Two-Layer Convolutional Neural Networks with Soft-Thresholding Chunyan Xiong Meng Lu Xiaotong Yu JIAN-PENG Cao Zhong Chen D. Guo X. Qu MLT 43 0 0 14 Apr 2023
Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers Awni Altabaa Taylor Webb Jonathan D. Cohen John Lafferty 30 8 0 01 Apr 2023
Scalable, Detailed and Mask-Free Universal Photometric Stereo Satoshi Ikehata 33 31 0 28 Mar 2023
Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation Clinton Mo Kun Hu Chengjiang Long Zhiyong Wang 35 12 0 27 Mar 2023
Robotic Packaging Optimization with Reinforcement Learning E. Drijver Rodrigo Pérez-Dattari Jens Kober Cosimo Della Santina Zlatan Ajanović OffRL 23 1 0 26 Mar 2023
It is all Connected: A New Graph Formulation for Spatio-Temporal Forecasting Lars Odegaard Bentsen N. Warakagoda R. Stenbro P. Engelstad AI4TS 15 1 0 23 Mar 2023
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning Zaid Khan Yun Fu VLM 41 12 0 21 Mar 2023
Difficulty in chirality recognition for Transformer architectures learning chemical structures from string Yasuhiro Yoshikai T. Mizuno Shumpei Nemoto Hiroyuki Kusuhara 22 16 0 21 Mar 2023
Blind Estimation of Audio Processing Graph Sungho Lee Jaehyung Park Seungryeol Paik Kyogu Lee 25 9 0 15 Mar 2023
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale Fan Bao Shen Nie Kaiwen Xue Chongxuan Li Shiliang Pu Yaole Wang Gang Yue Yue Cao Hang Su Jun Zhu DiffM 207 151 0 12 Mar 2023
Transcription free filler word detection with Neural semi-CRFs Ge Zhu Yujia Yan Juan-Pablo Caceres Z. Duan 32 3 0 11 Mar 2023
TSMixer: An All-MLP Architecture for Time Series Forecasting Si-An Chen Chun-Liang Li Nate Yoder Sercan Ö. Arik Tomas Pfister AI4TS 36 157 0 10 Mar 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding Yuchen Li Yuan-Fang Li Andrej Risteski 120 61 0 07 Mar 2023
TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction Zhejun Zhang Alexander Liniger Dengxin Dai Feng Yu Luc Van Gool 82 42 0 07 Mar 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation Bobby He James Martens Guodong Zhang Aleksandar Botev Andy Brock Samuel L. Smith Yee Whye Teh 27 30 0 20 Feb 2023
Scaling Laws for Multilingual Neural Machine Translation Patrick Fernandes Behrooz Ghorbani Xavier Garcia Markus Freitag Orhan Firat 49 29 0 19 Feb 2023
Eagle: Large-Scale Learning of Turbulent Fluid Dynamics with Mesh Transformers Steeven Janny Aurélien Béneteau Madiha Nadri Wolf Julie Digne Nicolas Thome Christian Wolf AI4CE 84 32 0 16 Feb 2023
Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution Zhengyu Liang Yingqian Wang Longguang Wang Jungang Yang Shilin Zhou Y. Guo 42 38 0 16 Feb 2023
Spatial Functa: Scaling Functa to ImageNet Classification and Generation Matthias Bauer Emilien Dupont Andy Brock Dan Rosenbaum Jonathan Richard Schwarz Hyunjik Kim DiffM 36 35 0 06 Feb 2023
V1T: large-scale mouse V1 response prediction using a Vision Transformer Bryan M. Li I. M. Cornacchia Nathalie L Rochefort A. Onken 26 8 0 06 Feb 2023
Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular Property Prediction Christopher Fifty Joseph M. Paggi Ehsan Amid J. Leskovec R. Dror AI4CE 25 0 0 04 Feb 2023
Dual PatchNorm Manoj Kumar Mostafa Dehghani N. Houlsby UQCV ViT 29 11 0 02 Feb 2023
STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition Yucheng Lu Shivani Agrawal Suvinay Subramanian Oleg Rybakov Chris De Sa Amir Yazdanbakhsh 21 16 0 02 Feb 2023
Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps Goro Kobayashi Tatsuki Kuribayashi Sho Yokoi Kentaro Inui 36 14 0 01 Feb 2023
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases Xiaoxia Wu Cheng-rong Li Reza Yazdani Aminabadi Z. Yao Yuxiong He MQ 19 19 0 27 Jan 2023
Deep Quantum Error Correction Yoni Choukroun Lior Wolf 27 8 0 27 Jan 2023
Modelling Long Range Dependencies in $N$ D: From Task-Specific to a General Purpose CNN David M. Knigge David W. Romero Albert Gu E. Gavves Erik J. Bekkers Jakub M. Tomczak Mark Hoogendoorn J. Sonke 3DV 35 21 0 25 Jan 2023
Image Super-Resolution using Efficient Striped Window Transformer Jinpeng Shi Hui Li Tian Yu Liu Yulong Liu Hao Fei Jinchen Zhu Ling Zheng Shizhuang Weng 42 10 0 24 Jan 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale Floris Weers Vaishaal Shankar Angelos Katharopoulos Yinfei Yang Tom Gunter CLIP 23 4 0 19 Jan 2023
SPTS v2: Single-Point Scene Text Spotting Yuliang Liu Jiaxin Zhang Dezhi Peng Mingxin Huang Xinyu Wang ... Can Huang Dahua Lin Chunhua Shen Xiang Bai Lianwen Jin VLM 34 50 0 04 Jan 2023
Edge Enhanced Image Style Transfer via Transformers Chi Zhang Jun Yang Zaiyan Dai Peng-Xia Cao 16 10 0 02 Jan 2023
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification Ziyi Tang Ruimao Zhang Zhanglin Peng Jinrui Chen Liang Lin 33 18 0 02 Jan 2023
On Transforming Reinforcement Learning by Transformer: The Development Trajectory Shengchao Hu Li Shen Ya Zhang Yixin Chen Dacheng Tao OffRL 30 25 0 29 Dec 2022
Cramming: Training a Language Model on a Single GPU in One Day Jonas Geiping Tom Goldstein MoE 32 86 0 28 Dec 2022
On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective Ying Wen Bo Liu M. Zhou Shufang Hou Zhe Cao Chenyang Le Jingxiao Chen Zheng Tian Weinan Zhang Jun Wang AI4CE 26 10 0 24 Dec 2022
Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation Wenjie Hao Hongfei Xu Lingling Mu Hongying Zan MoE 38 4 0 24 Dec 2022
Generative Colorization of Structured Mobile Web Pages Kotaro Kikuchi Naoto Inoue Mayu Otani E. Simo-Serra Kota Yamaguchi 10 9 0 22 Dec 2022
What Makes for Good Tokenizers in Vision Transformer? Shengju Qian Yi Zhu Wenbo Li Mu Li Jiaya Jia ViT 37 14 0 21 Dec 2022
SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations Ioannis Tsiamas José A. R. Fonollosa Marta R. Costa-jussá 46 6 0 19 Dec 2022
Latent Diffusion for Language Generation Justin Lovelace Varsha Kishore Chao-gang Wan Eliot Shekhtman Kilian Q. Weinberger DiffM 29 71 0 19 Dec 2022
Inductive Attention for Video Action Anticipation Tsung-Ming Tai G. Fiameni Cheng-Kuang Lee Simon See Oswald Lanz 39 1 0 17 Dec 2022
Efficient Long Sequence Modeling via State Space Augmented Transformer Simiao Zuo Xiaodong Liu Jian Jiao Denis Xavier Charles Eren Manavoglu Tuo Zhao Jianfeng Gao 130 36 0 15 Dec 2022
Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation Maha Elbayad Anna Y. Sun Shruti Bhosale MoE 59 9 0 15 Dec 2022
Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data Matthias Zeller Jens Behley Michael Heidingsfeld C. Stachniss 37 24 0 07 Dec 2022
Improve Bilingual TTS Using Dynamic Language and Phonology Embedding Fengyu Yang Jian Luan Yujun Wang 21 1 0 07 Dec 2022
Cross-lingual Similarity of Multilingual Representations Revisited Maksym Del Mark Fishel 31 3 0 04 Dec 2022
Simplifying and Understanding State Space Models with Diagonal Linear RNNs Ankit Gupta Harsh Mehta Jonathan Berant 29 21 0 01 Dec 2022
Continuous diffusion for categorical data Sander Dieleman Laurent Sartran Arman Roshannai Nikolay Savinov Yaroslav Ganin ... Conor Durkan Curtis Hawthorne Rémi Leblond Will Grathwohl J. Adler DiffM 32 100 0 28 Nov 2022