ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.03928
  4. Cited By
Which transformer architecture fits my data? A vocabulary bottleneck in
  self-attention

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

9 May 2021
Noam Wies
Yoav Levine
Daniel Jannai
Amnon Shashua
ArXivPDFHTML

Papers citing "Which transformer architecture fits my data? A vocabulary bottleneck in self-attention"

36 / 36 papers shown
Title
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence Modeling
Lili Chen
Kevin Lu
Aravind Rajeswaran
Kimin Lee
Aditya Grover
Michael Laskin
Pieter Abbeel
A. Srinivas
Igor Mordatch
OffRL
71
1,608
0
02 Jun 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Linting Xue
Aditya Barua
Noah Constant
Rami Al-Rfou
Sharan Narang
Mihir Kale
Adam Roberts
Colin Raffel
51
482
0
28 May 2021
Scaling Laws for Autoregressive Generative Modeling
Scaling Laws for Autoregressive Generative Modeling
T. Henighan
Jared Kaplan
Mor Katz
Mark Chen
Christopher Hesse
...
Nick Ryder
Daniel M. Ziegler
John Schulman
Dario Amodei
Sam McCandlish
69
414
0
28 Oct 2020
mT5: A massively multilingual pre-trained text-to-text transformer
mT5: A massively multilingual pre-trained text-to-text transformer
Linting Xue
Noah Constant
Adam Roberts
Mihir Kale
Rami Al-Rfou
Aditya Siddhant
Aditya Barua
Colin Raffel
83
2,489
0
22 Oct 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
172
40,217
0
22 Oct 2020
PMI-Masking: Principled masking of correlated spans
PMI-Masking: Principled masking of correlated spans
Yoav Levine
Barak Lenz
Opher Lieber
Omri Abend
Kevin Leyton-Brown
Moshe Tennenholtz
Y. Shoham
29
72
0
05 Oct 2020
The Depth-to-Width Interplay in Self-Attention
The Depth-to-Width Interplay in Self-Attention
Yoav Levine
Noam Wies
Or Sharir
Hofit Bata
Amnon Shashua
45
46
0
22 Jun 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
  Representations
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Alexei Baevski
Henry Zhou
Abdel-rahman Mohamed
Michael Auli
SSL
101
5,677
0
20 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
383
41,106
0
28 May 2020
Normalized Attention Without Probability Cage
Normalized Attention Without Probability Cage
Oliver Richter
Roger Wattenhofer
37
21
0
19 May 2020
Jukebox: A Generative Model for Music
Jukebox: A Generative Model for Music
Prafulla Dhariwal
Heewoo Jun
Christine Payne
Jong Wook Kim
Alec Radford
Ilya Sutskever
VLM
82
731
0
30 Apr 2020
Low-Rank Bottleneck in Multi-head Attention Models
Low-Rank Bottleneck in Multi-head Attention Models
Srinadh Bhojanapalli
Chulhee Yun
A. S. Rawat
Sashank J. Reddi
Sanjiv Kumar
32
95
0
17 Feb 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
383
4,662
0
23 Jan 2020
Improving Transformer Models by Reordering their Sublayers
Improving Transformer Models by Reordering their Sublayers
Ofir Press
Noah A. Smith
Omer Levy
34
87
0
10 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
206
19,824
0
23 Oct 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language
  Representations
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
SSL
AIMat
242
6,420
0
26 Sep 2019
Learning to Deceive with Attention-Based Explanations
Learning to Deceive with Attention-Based Explanations
Danish Pruthi
Mansi Gupta
Bhuwan Dhingra
Graham Neubig
Zachary Chase Lipton
36
193
0
17 Sep 2019
On Identifiability in Transformers
On Identifiability in Transformers
Gino Brunner
Yang Liu
Damian Pascual
Oliver Richter
Massimiliano Ciaramita
Roger Wattenhofer
ViT
47
188
0
12 Aug 2019
Scaling Autoregressive Video Models
Scaling Autoregressive Video Models
Dirk Weissenborn
Oscar Täckström
Jakob Uszkoreit
DiffM
VGen
62
200
0
06 Jun 2019
Are Sixteen Heads Really Better than One?
Are Sixteen Heads Really Better than One?
Paul Michel
Omer Levy
Graham Neubig
MoE
43
1,049
0
25 May 2019
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
60
1,880
0
23 Apr 2019
Analysing Mathematical Reasoning Abilities of Neural Models
Analysing Mathematical Reasoning Abilities of Neural Models
D. Saxton
Edward Grefenstette
Felix Hill
Pushmeet Kohli
LRM
95
420
0
02 Apr 2019
Attention is not Explanation
Attention is not Explanation
Sarthak Jain
Byron C. Wallace
FAtt
78
1,307
0
26 Feb 2019
Deep autoregressive models for the efficient variational simulation of
  many-body quantum systems
Deep autoregressive models for the efficient variational simulation of many-body quantum systems
Or Sharir
Yoav Levine
Noam Wies
Giuseppe Carleo
Amnon Shashua
97
188
0
11 Feb 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
808
93,936
0
11 Oct 2018
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo
John Richardson
131
3,490
0
19 Aug 2018
Quantum Entanglement in Deep Learning Architectures
Quantum Entanglement in Deep Learning Architectures
Yoav Levine
Or Sharir
Nadav Cohen
Amnon Shashua
51
182
0
26 Mar 2018
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang
Zihang Dai
Ruslan Salakhutdinov
William W. Cohen
BDL
40
367
0
10 Nov 2017
On the Long-Term Memory of Deep Recurrent Networks
On the Long-Term Memory of Deep Recurrent Networks
Yoav Levine
Or Sharir
Alon Ziv
Amnon Shashua
31
24
0
25 Oct 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
314
129,831
0
12 Jun 2017
Analysis and Design of Convolutional Networks via Hierarchical Tensor
  Decompositions
Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions
Nadav Cohen
Or Sharir
Yoav Levine
Ronen Tamari
David Yakira
Amnon Shashua
91
38
0
05 May 2017
Deep Learning and Quantum Entanglement: Fundamental Connections with
  Implications to Network Design
Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design
Yoav Levine
David Yakira
Nadav Cohen
Amnon Shashua
92
126
0
05 Apr 2017
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep
  Learning Model
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model
Sheng Wang
S. Sun
Zerui Li
Renyu Zhang
Jinbo Xu
33
818
0
02 Sep 2016
Using the Output Embedding to Improve Language Models
Using the Output Embedding to Improve Language Models
Ofir Press
Lior Wolf
46
731
0
20 Aug 2016
Inductive Bias of Deep Convolutional Networks through Pooling Geometry
Inductive Bias of Deep Convolutional Networks through Pooling Geometry
Nadav Cohen
Amnon Shashua
31
132
0
22 May 2016
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich
Barry Haddow
Alexandra Birch
131
7,683
0
31 Aug 2015
1