Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1905.09418
Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"
50 / 228 papers shown
Title
Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives
Ben Saunders
Necati Cihan Camgöz
Richard Bowden
SLR
27
50
0
23 Jul 2021
Learned Token Pruning for Transformers
Sehoon Kim
Sheng Shen
D. Thorsley
A. Gholami
Woosuk Kwon
Joseph Hassoun
Kurt Keutzer
17
145
0
02 Jul 2021
A Primer on Pretrained Multilingual Language Models
Sumanth Doddapaneni
Gowtham Ramesh
Mitesh M. Khapra
Anoop Kunchukuttan
Pratyush Kumar
LRM
43
74
0
01 Jul 2021
AutoFormer: Searching Transformers for Visual Recognition
Minghao Chen
Houwen Peng
Jianlong Fu
Haibin Ling
ViT
49
259
0
01 Jul 2021
The MultiBERTs: BERT Reproductions for Robustness Analysis
Thibault Sellam
Steve Yadlowsky
Jason W. Wei
Naomi Saphra
Alexander DÁmour
...
Iulia Turc
Jacob Eisenstein
Dipanjan Das
Ian Tenney
Ellie Pavlick
24
93
0
30 Jun 2021
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Ahjeong Seo
Gi-Cheon Kang
J. Park
Byoung-Tak Zhang
18
53
0
19 Jun 2021
What Context Features Can Transformer Language Models Use?
J. O'Connor
Jacob Andreas
KELM
29
75
0
15 Jun 2021
Pre-Trained Models: Past, Present and Future
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFin
MQ
AI4MH
58
816
0
14 Jun 2021
Patch Slimming for Efficient Vision Transformers
Yehui Tang
Kai Han
Yunhe Wang
Chang Xu
Jianyuan Guo
Chao Xu
Dacheng Tao
ViT
26
163
0
05 Jun 2021
On Compositional Generalization of Neural Machine Translation
Yafu Li
Yongjing Yin
Yulong Chen
Yue Zhang
156
45
0
31 May 2021
LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond
Daniel Loureiro
A. Jorge
Jose Camacho-Collados
35
26
0
26 May 2021
Rationalization through Concepts
Diego Antognini
Boi Faltings
FAtt
27
19
0
11 May 2021
FNet: Mixing Tokens with Fourier Transforms
James Lee-Thorp
Joshua Ainslie
Ilya Eckstein
Santiago Ontanon
47
520
0
09 May 2021
Long-Span Summarization via Local Attention and Content Selection
Potsawee Manakul
Mark Gales
21
42
0
08 May 2021
Let's Play Mono-Poly: BERT Can Reveal Words' Polysemy Level and Partitionability into Senses
Aina Garí Soler
Marianna Apidianaki
MILM
211
68
0
29 Apr 2021
Code Structure Guided Transformer for Source Code Summarization
Shuzheng Gao
Cuiyun Gao
Yulan He
Jichuan Zeng
L. Nie
Xin Xia
Michael R. Lyu
22
96
0
19 Apr 2021
Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation
Mozhdeh Gheini
Xiang Ren
Jonathan May
LRM
31
105
0
18 Apr 2021
Knowledge Neurons in Pretrained Transformers
Damai Dai
Li Dong
Y. Hao
Zhifang Sui
Baobao Chang
Furu Wei
KELM
MU
28
418
0
18 Apr 2021
Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm
Dongkuan Xu
Ian En-Hsu Yen
Jinxi Zhao
Zhibin Xiao
VLM
AAML
31
56
0
18 Apr 2021
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders
Fangyu Liu
Ivan Vulić
Anna Korhonen
Nigel Collier
VLM
OffRL
27
117
0
16 Apr 2021
Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey
Danielle Saunders
AI4CE
27
86
0
14 Apr 2021
DirectProbe: Studying Representations without Classifiers
Yichu Zhou
Vivek Srikumar
32
27
0
13 Apr 2021
UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost
Zhen Wu
Lijun Wu
Qi Meng
Yingce Xia
Shufang Xie
Tao Qin
Xinyu Dai
Tie-Yan Liu
18
22
0
11 Apr 2021
VisQA: X-raying Vision and Language Reasoning in Transformers
Theo Jaunet
Corentin Kervadec
Romain Vuillemot
G. Antipov
M. Baccouche
Christian Wolf
19
26
0
02 Apr 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
Hila Chefer
Shir Gur
Lior Wolf
ViT
31
302
0
29 Mar 2021
Dodrio: Exploring Transformer Models with Interactive Visualization
Zijie J. Wang
Robert Turko
Duen Horng Chau
40
35
0
26 Mar 2021
Pruning-then-Expanding Model for Domain Adaptation of Neural Machine Translation
Shuhao Gu
Yang Feng
Wanying Xie
CLL
AI4CE
25
27
0
25 Mar 2021
Structured Co-reference Graph Attention for Video-grounded Dialogue
Junyeong Kim
Sunjae Yoon
Dahyun Kim
Chang D. Yoo
26
26
0
24 Mar 2021
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures
Sushant Singh
A. Mahmood
AI4TS
60
94
0
23 Mar 2021
Learning Calibrated-Guidance for Object Detection in Aerial Images
Zongqi Wei
Dong Liang
Dong-Ming Zhang
Liyan Zhang
Qixiang Geng
Mingqiang Wei
Huiyu Zhou
30
35
0
21 Mar 2021
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
227
2,434
0
04 Jan 2021
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
39
747
0
29 Dec 2020
CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade
Lei Li
Yankai Lin
Deli Chen
Shuhuai Ren
Peng Li
Jie Zhou
Xu Sun
29
51
0
29 Dec 2020
Multi-Head Self-Attention with Role-Guided Masks
Dongsheng Wang
Casper Hansen
Lucas Chaves Lima
Christian B. Hansen
Maria Maistro
J. Simonsen
Christina Lioma
26
1
0
22 Dec 2020
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Hanrui Wang
Zhekai Zhang
Song Han
43
380
0
17 Dec 2020
Transformer Interpretability Beyond Attention Visualization
Hila Chefer
Shir Gur
Lior Wolf
45
644
0
17 Dec 2020
Positional Artefacts Propagate Through Masked Language Model Embeddings
Ziyang Luo
Artur Kulmizev
Xiaoxi Mao
29
41
0
09 Nov 2020
Rethinking the Value of Transformer Components
Wenxuan Wang
Zhaopeng Tu
24
38
0
07 Nov 2020
Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads
Zhengyan Zhang
Fanchao Qi
Zhiyuan Liu
Qun Liu
Maosong Sun
VLM
46
30
0
07 Nov 2020
Rethinking embedding coupling in pre-trained language models
Hyung Won Chung
Thibault Févry
Henry Tsai
Melvin Johnson
Sebastian Ruder
95
142
0
24 Oct 2020
Pretrained Transformers for Text Ranking: BERT and Beyond
Jimmy J. Lin
Rodrigo Nogueira
Andrew Yates
VLM
244
612
0
13 Oct 2020
The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
Jasmijn Bastings
Katja Filippova
XAI
LRM
54
174
0
12 Oct 2020
SMYRF: Efficient Attention using Asymmetric Clustering
Giannis Daras
Nikita Kitaev
Augustus Odena
A. Dimakis
31
44
0
11 Oct 2020
On the Sub-Layer Functionalities of Transformer Decoder
Yilin Yang
Longyue Wang
Shuming Shi
Prasad Tadepalli
Stefan Lee
Zhaopeng Tu
26
27
0
06 Oct 2020
TernaryBERT: Distillation-aware Ultra-low Bit BERT
Wei Zhang
Lu Hou
Yichun Yin
Lifeng Shang
Xiao Chen
Xin Jiang
Qun Liu
MQ
33
208
0
27 Sep 2020
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Bang An
Jie Lyu
Zhenyi Wang
Chunyuan Li
Changwei Hu
Fei Tan
Ruiyi Zhang
Yifan Hu
Changyou Chen
AAML
22
28
0
20 Sep 2020
Efficient Transformers: A Survey
Yi Tay
Mostafa Dehghani
Dara Bahri
Donald Metzler
VLM
114
1,103
0
14 Sep 2020
Time-based Sequence Model for Personalization and Recommendation Systems
T. Ishkhanov
Maxim Naumov
Xianjie Chen
Yan Zhu
Yuan Zhong
A. Azzolini
Chonglin Sun
Frank Jiang
Andrey Malevich
Liang Xiong
30
17
0
27 Aug 2020
Data Movement Is All You Need: A Case Study on Optimizing Transformers
A. Ivanov
Nikoli Dryden
Tal Ben-Nun
Shigang Li
Torsten Hoefler
36
131
0
30 Jun 2020
Multi-Head Attention: Collaborate Instead of Concatenate
Jean-Baptiste Cordonnier
Andreas Loukas
Martin Jaggi
6
108
0
29 Jun 2020
Previous
1
2
3
4
5
Next