Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

30 August 2019

Papers citing "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel"

8 / 58 papers shown

Title
Linear Transformers Are Secretly Fast Weight Programmers Imanol Schlag Kazuki Irie Jürgen Schmidhuber 51 228 0 22 Feb 2021
LieTransformer: Equivariant self-attention for Lie Groups M. Hutchinson Charline Le Lan Sheheryar Zaidi Emilien Dupont Yee Whye Teh Hyunjik Kim 31 111 0 20 Dec 2020
Rethinking Attention with Performers K. Choromanski Valerii Likhosherstov David Dohan Xingyou Song Andreea Gane ... Afroz Mohiuddin Lukasz Kaiser David Belanger Lucy J. Colwell Adrian Weller 82 1,527 0 30 Sep 2020
On the Computational Power of Transformers and its Implications in Sequence Modeling S. Bhattamishra Arkil Patel Navin Goyal 33 66 0 16 Jun 2020
The Lipschitz Constant of Self-Attention Hyunjik Kim George Papamakarios A. Mnih 16 135 0 08 Jun 2020
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers K. Choromanski Valerii Likhosherstov David Dohan Xingyou Song Andreea Gane ... Peter Hawkins Jared Davis David Belanger Lucy J. Colwell Adrian Weller 44 84 0 05 Jun 2020
Kernel Self-Attention in Deep Multiple Instance Learning Dawid Rymarczyk Adriana Borowa Jacek Tabor Bartosz Zieliñski SSL 14 5 0 25 May 2020
Classical Structured Prediction Losses for Sequence to Sequence Learning Sergey Edunov Myle Ott Michael Auli David Grangier MarcÁurelio Ranzato AIMat 56 185 0 14 Nov 2017