Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

9 May 2021

Papers citing "Which transformer architecture fits my data? A vocabulary bottleneck in self-attention"

36 / 36 papers shown

Title
Decision Transformer: Reinforcement Learning via Sequence Modeling Lili Chen Kevin Lu Aravind Rajeswaran Kimin Lee Aditya Grover Michael Laskin Pieter Abbeel A. Srinivas Igor Mordatch OffRL 71 1,608 0 02 Jun 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models Linting Xue Aditya Barua Noah Constant Rami Al-Rfou Sharan Narang Mihir Kale Adam Roberts Colin Raffel 51 482 0 28 May 2021
Scaling Laws for Autoregressive Generative Modeling T. Henighan Jared Kaplan Mor Katz Mark Chen Christopher Hesse ... Nick Ryder Daniel M. Ziegler John Schulman Dario Amodei Sam McCandlish 69 414 0 28 Oct 2020
mT5: A massively multilingual pre-trained text-to-text transformer Linting Xue Noah Constant Adam Roberts Mihir Kale Rami Al-Rfou Aditya Siddhant Aditya Barua Colin Raffel 83 2,489 0 22 Oct 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai ... Matthias Minderer G. Heigold Sylvain Gelly Jakob Uszkoreit N. Houlsby ViT 172 40,217 0 22 Oct 2020
PMI-Masking: Principled masking of correlated spans Yoav Levine Barak Lenz Opher Lieber Omri Abend Kevin Leyton-Brown Moshe Tennenholtz Y. Shoham 29 72 0 05 Oct 2020
The Depth-to-Width Interplay in Self-Attention Yoav Levine Noam Wies Or Sharir Hofit Bata Amnon Shashua 45 46 0 22 Jun 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski Henry Zhou Abdel-rahman Mohamed Michael Auli SSL 101 5,677 0 20 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 383 41,106 0 28 May 2020
Normalized Attention Without Probability Cage Oliver Richter Roger Wattenhofer 37 21 0 19 May 2020
Jukebox: A Generative Model for Music Prafulla Dhariwal Heewoo Jun Christine Payne Jong Wook Kim Alec Radford Ilya Sutskever VLM 82 731 0 30 Apr 2020
Low-Rank Bottleneck in Multi-head Attention Models Srinadh Bhojanapalli Chulhee Yun A. S. Rawat Sashank J. Reddi Sanjiv Kumar 32 95 0 17 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 383 4,662 0 23 Jan 2020
Improving Transformer Models by Reordering their Sublayers Ofir Press Noah A. Smith Omer Levy 34 87 0 10 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Colin Raffel Noam M. Shazeer Adam Roberts Katherine Lee Sharan Narang Michael Matena Yanqi Zhou Wei Li Peter J. Liu AIMat 206 19,824 0 23 Oct 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Zhenzhong Lan Mingda Chen Sebastian Goodman Kevin Gimpel Piyush Sharma Radu Soricut SSL AIMat 242 6,420 0 26 Sep 2019
Learning to Deceive with Attention-Based Explanations Danish Pruthi Mansi Gupta Bhuwan Dhingra Graham Neubig Zachary Chase Lipton 36 193 0 17 Sep 2019
On Identifiability in Transformers Gino Brunner Yang Liu Damian Pascual Oliver Richter Massimiliano Ciaramita Roger Wattenhofer ViT 47 188 0 12 Aug 2019
Scaling Autoregressive Video Models Dirk Weissenborn Oscar Täckström Jakob Uszkoreit DiffM VGen 62 200 0 06 Jun 2019
Are Sixteen Heads Really Better than One? Paul Michel Omer Levy Graham Neubig MoE 43 1,049 0 25 May 2019
Generating Long Sequences with Sparse Transformers R. Child Scott Gray Alec Radford Ilya Sutskever 60 1,880 0 23 Apr 2019
Analysing Mathematical Reasoning Abilities of Neural Models D. Saxton Edward Grefenstette Felix Hill Pushmeet Kohli LRM 95 420 0 02 Apr 2019
Attention is not Explanation Sarthak Jain Byron C. Wallace FAtt 78 1,307 0 26 Feb 2019
Deep autoregressive models for the efficient variational simulation of many-body quantum systems Or Sharir Yoav Levine Noam Wies Giuseppe Carleo Amnon Shashua 97 188 0 11 Feb 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 808 93,936 0 11 Oct 2018
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Taku Kudo John Richardson 131 3,490 0 19 Aug 2018
Quantum Entanglement in Deep Learning Architectures Yoav Levine Or Sharir Nadav Cohen Amnon Shashua 51 182 0 26 Mar 2018
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Zhilin Yang Zihang Dai Ruslan Salakhutdinov William W. Cohen BDL 40 367 0 10 Nov 2017
On the Long-Term Memory of Deep Recurrent Networks Yoav Levine Or Sharir Alon Ziv Amnon Shashua 31 24 0 25 Oct 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 314 129,831 0 12 Jun 2017
Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions Nadav Cohen Or Sharir Yoav Levine Ronen Tamari David Yakira Amnon Shashua 91 38 0 05 May 2017
Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design Yoav Levine David Yakira Nadav Cohen Amnon Shashua 92 126 0 05 Apr 2017
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model Sheng Wang S. Sun Zerui Li Renyu Zhang Jinbo Xu 33 818 0 02 Sep 2016
Using the Output Embedding to Improve Language Models Ofir Press Lior Wolf 46 731 0 20 Aug 2016
Inductive Bias of Deep Convolutional Networks through Pooling Geometry Nadav Cohen Amnon Shashua 31 132 0 22 May 2016
Neural Machine Translation of Rare Words with Subword Units Rico Sennrich Barry Haddow Alexandra Birch 131 7,683 0 31 Aug 2015