Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.04534
Cited By
v1
v2
v3 (latest)
Pay Attention when Required
9 September 2020
Swetha Mandava
Szymon Migacz
A. Fit-Florea
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Pay Attention when Required"
21 / 21 papers shown
Title
Multi-Head Attention: Collaborate Instead of Concatenate
Jean-Baptiste Cordonnier
Andreas Loukas
Martin Jaggi
54
114
0
29 Jun 2020
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wenhui Wang
Furu Wei
Li Dong
Hangbo Bao
Nan Yang
Ming Zhou
VLM
170
1,280
0
25 Feb 2020
Improving Transformer Models by Reordering their Sublayers
Ofir Press
Noah A. Smith
Omer Levy
68
87
0
10 Nov 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
255
7,547
0
02 Oct 2019
TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao
Yichun Yin
Lifeng Shang
Xin Jiang
Xiao Chen
Linlin Li
F. Wang
Qun Liu
VLM
109
1,869
0
23 Sep 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
334
1,917
0
17 Sep 2019
Augmenting Self-attention with Persistent Memory
Sainbayar Sukhbaatar
Edouard Grave
Guillaume Lample
Hervé Jégou
Armand Joulin
RALM
KELM
73
139
0
02 Jul 2019
Are Sixteen Heads Really Better than One?
Paul Michel
Omer Levy
Graham Neubig
MoE
105
1,068
0
25 May 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
257
999
0
01 Apr 2019
The Evolved Transformer
David R. So
Chen Liang
Quoc V. Le
ViT
110
464
0
30 Jan 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai
Zhilin Yang
Yiming Yang
J. Carbonell
Quoc V. Le
Ruslan Salakhutdinov
VLM
253
3,745
0
09 Jan 2019
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
Han Cai
Ligeng Zhu
Song Han
102
1,875
0
02 Dec 2018
Character-Level Language Modeling with Deeper Self-Attention
Rami Al-Rfou
Dokook Choe
Noah Constant
Mandy Guo
Llion Jones
145
392
0
09 Aug 2018
DARTS: Differentiable Architecture Search
Hanxiao Liu
Karen Simonyan
Yiming Yang
204
4,366
0
24 Jun 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
1.1K
7,196
0
20 Apr 2018
Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph
Vijay Vasudevan
Jonathon Shlens
Quoc V. Le
186
5,607
0
21 Jul 2017
Neural Architecture Search with Reinforcement Learning
Barret Zoph
Quoc V. Le
478
5,381
0
05 Nov 2016
Categorical Reparameterization with Gumbel-Softmax
Eric Jang
S. Gu
Ben Poole
BDL
349
5,379
0
03 Nov 2016
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
Chris J. Maddison
A. Mnih
Yee Whye Teh
BDL
198
2,537
0
02 Nov 2016
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
338
2,898
0
26 Sep 2016
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
Jian Zhang
Konstantin Lopyrev
Percy Liang
RALM
314
8,169
0
16 Jun 2016
1