Inductive Biases and Variable Creation in Self-Attention Mechanisms

19 October 2021

Papers citing "Inductive Biases and Variable Creation in Self-Attention Mechanisms"

48 / 48 papers shown

Title
Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention Arya Honarpisheh Mustafa Bozdag Octavia Camps Mario Sznaier Mamba 98 1 0 03 Feb 2025
Approximation Rate of the Transformer Architecture for Sequence Modeling Hao Jiang Qianxiao Li 58 10 0 03 Jan 2025
Training Neural Networks as Recognizers of Formal Languages Alexandra Butoi Ghazal Khalighinejad Anej Svete Josef Valvoda Ryan Cotterell Brian DuSell NAI 56 5 0 11 Nov 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 92 1 0 15 Jul 2024
Length independent generalization bounds for deep SSM architectures via Rademacher contraction and stability constraints Dániel Rácz Mihaly Petreczky Bálint Daróczy 57 1 0 30 May 2024
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 296 494 0 24 Sep 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power Yuri Burda Harrison Edwards Igor Babuschkin Vedant Misra 39 347 0 06 Jan 2022
Perceiver IO: A General Architecture for Structured Inputs & Outputs Andrew Jaegle Sebastian Borgeaud Jean-Baptiste Alayrac Carl Doersch Catalin Ionescu ... Olivier J. Hénaff M. Botvinick Andrew Zisserman Oriol Vinyals João Carreira MLLM VLM GNN 36 571 0 30 Jul 2021
Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers Colin Wei Yining Chen Tengyu Ma 26 89 0 28 Jul 2021
Evaluating Large Language Models Trained on Code Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Pondé ... Bob McGrew Dario Amodei Sam McCandlish Ilya Sutskever Wojciech Zaremba ELM ALM 132 5,328 0 07 Jul 2021
On the Expressive Power of Self-Attention Matrices Valerii Likhosherstov K. Choromanski Adrian Weller 54 34 0 07 Jun 2021
Decision Transformer: Reinforcement Learning via Sequence Modeling Lili Chen Kevin Lu Aravind Rajeswaran Kimin Lee Aditya Grover Michael Laskin Pieter Abbeel A. Srinivas Igor Mordatch OffRL 71 1,608 0 02 Jun 2021
FNet: Mixing Tokens with Fourier Transforms James Lee-Thorp Joshua Ainslie Ilya Eckstein Santiago Ontanon 70 522 0 09 May 2021
MLP-Mixer: An all-MLP Architecture for Vision Ilya O. Tolstikhin N. Houlsby Alexander Kolesnikov Lucas Beyer Xiaohua Zhai ... Andreas Steiner Daniel Keysers Jakob Uszkoreit Mario Lucic Alexey Dosovitskiy 371 2,638 0 04 May 2021
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases Stéphane dÁscoli Hugo Touvron Matthew L. Leavitt Ari S. Morcos Giulio Biroli Levent Sagun ViT 80 818 0 19 Mar 2021
Approximating How Single Head Attention Learns Charles Burton Snell Ruiqi Zhong Dan Klein Jacob Steinhardt MLT 14 30 0 13 Mar 2021
Pretrained Transformers as Universal Computation Engines Kevin Lu Aditya Grover Pieter Abbeel Igor Mordatch 42 220 0 09 Mar 2021
Perceiver: General Perception with Iterative Attention Andrew Jaegle Felix Gimeno Andrew Brock Andrew Zisserman Oriol Vinyals João Carreira VLM ViT MDE 115 999 0 04 Mar 2021
Neural Production Systems: Learning Rule-Governed Visual Dynamics Anirudh Goyal Aniket Didolkar Nan Rosemary Ke Charles Blundell Philippe Beaudoin N. Heess Michael C. Mozer Yoshua Bengio OCL 69 82 0 02 Mar 2021
Long Range Arena: A Benchmark for Efficient Transformers Yi Tay Mostafa Dehghani Samira Abnar Songlin Yang Dara Bahri Philip Pham J. Rao Liu Yang Sebastian Ruder Donald Metzler 71 706 0 08 Nov 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai ... Matthias Minderer G. Heigold Sylvain Gelly Jakob Uszkoreit N. Houlsby ViT 157 40,217 0 22 Oct 2020
Rethinking Attention with Performers K. Choromanski Valerii Likhosherstov David Dohan Xingyou Song Andreea Gane ... Afroz Mohiuddin Lukasz Kaiser David Belanger Lucy J. Colwell Adrian Weller 113 1,548 0 30 Sep 2020
Tensor Programs II: Neural Tangent Kernel for Any Architecture Greg Yang 86 136 0 25 Jun 2020
Infinite attention: NNGP and NTK for deep attention networks Jiri Hron Yasaman Bahri Jascha Narain Sohl-Dickstein Roman Novak 16 114 0 18 Jun 2020
On the Computational Power of Transformers and its Implications in Sequence Modeling S. Bhattamishra Arkil Patel Navin Goyal 43 66 0 16 Jun 2020
The Lipschitz Constant of Self-Attention Hyunjik Kim George Papamakarios A. Mnih 29 141 0 08 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 359 41,106 0 28 May 2020
A Primer in BERTology: What we know about how BERT works Anna Rogers Olga Kovaleva Anna Rumshisky OffRL 46 1,478 0 27 Feb 2020
Are Transformers universal approximators of sequence-to-sequence functions? Chulhee Yun Srinadh Bhojanapalli A. S. Rawat Sashank J. Reddi Sanjiv Kumar 66 347 0 20 Dec 2019
Fantastic Generalization Measures and Where to Find Them Yiding Jiang Behnam Neyshabur H. Mobahi Dilip Krishnan Samy Bengio AI4CE 46 599 0 04 Dec 2019
On Generalization Bounds of a Family of Recurrent Neural Networks Minshuo Chen Xingguo Li T. Zhao 26 71 0 28 Oct 2019
Recurrent Independent Mechanisms Anirudh Goyal Alex Lamb Jordan Hoffmann Shagun Sodhani Sergey Levine Yoshua Bengio Bernhard Schölkopf 49 336 0 24 Sep 2019
Theoretical Limitations of Self-Attention in Neural Sequence Models Michael Hahn 35 266 0 16 Jun 2019
What Does BERT Look At? An Analysis of BERT's Attention Kevin Clark Urvashi Khandelwal Omer Levy Christopher D. Manning MILM 170 1,586 0 11 Jun 2019
Generalization bounds for deep convolutional neural networks Philip M. Long Hanie Sedghi MLT 54 90 0 29 May 2019
BERT Rediscovers the Classical NLP Pipeline Ian Tenney Dipanjan Das Ellie Pavlick MILM SSeg 75 1,458 0 15 May 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 751 93,936 0 11 Oct 2018
Generating Wikipedia by Summarizing Long Sequences Peter J. Liu Mohammad Saleh Etienne Pot Ben Goodrich Ryan Sepassi Lukasz Kaiser Noam M. Shazeer CVBM 103 786 0 30 Jan 2018
Size-Independent Sample Complexity of Neural Networks Noah Golowich Alexander Rakhlin Ohad Shamir 63 547 0 18 Dec 2017
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks Behnam Neyshabur Srinadh Bhojanapalli Nathan Srebro 49 604 0 29 Jul 2017
Spectrally-normalized margin bounds for neural networks Peter L. Bartlett Dylan J. Foster Matus Telgarsky ODL 107 1,208 0 26 Jun 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 278 129,831 0 12 Jun 2017
Layer Normalization Jimmy Lei Ba J. Kiros Geoffrey E. Hinton 187 10,412 0 21 Jul 2016
Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu H. Pham Christopher D. Manning 278 7,942 0 17 Aug 2015
Norm-Based Capacity Control in Neural Networks Behnam Neyshabur Ryota Tomioka Nathan Srebro 199 583 0 27 Feb 2015
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Ke Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhutdinov R. Zemel Yoshua Bengio DiffM 243 10,034 0 10 Feb 2015
Adam: A Method for Stochastic Optimization Diederik P. Kingma Jimmy Ba ODL 421 149,474 0 22 Dec 2014
Neural Machine Translation by Jointly Learning to Align and Translate Dzmitry Bahdanau Kyunghyun Cho Yoshua Bengio AIMat 308 27,205 0 01 Sep 2014