ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1905.09418
  4. Cited By
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

23 May 2019
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
ArXivPDFHTML

Papers citing "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned"

50 / 228 papers shown
Title
Survey: Exploiting Data Redundancy for Optimization of Deep Learning
Survey: Exploiting Data Redundancy for Optimization of Deep Learning
Jou-An Chen
Wei Niu
Bin Ren
Yanzhi Wang
Xipeng Shen
23
24
0
29 Aug 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of
  Deep Neural Networks
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
28
124
0
27 Jul 2022
Probing via Prompting
Probing via Prompting
Jiaoda Li
Ryan Cotterell
Mrinmaya Sachan
37
13
0
04 Jul 2022
The Topological BERT: Transforming Attention into Topology for Natural
  Language Processing
The Topological BERT: Transforming Attention into Topology for Natural Language Processing
Ilan Perez
Raphael Reinauer
30
17
0
30 Jun 2022
Unveiling Transformers with LEGO: a synthetic reasoning task
Unveiling Transformers with LEGO: a synthetic reasoning task
Yi Zhang
A. Backurs
Sébastien Bubeck
Ronen Eldan
Suriya Gunasekar
Tal Wagner
LRM
36
85
0
09 Jun 2022
Optimizing Relevance Maps of Vision Transformers Improves Robustness
Optimizing Relevance Maps of Vision Transformers Improves Robustness
Hila Chefer
Idan Schwartz
Lior Wolf
ViT
40
38
0
02 Jun 2022
Lack of Fluency is Hurting Your Translation Model
Lack of Fluency is Hurting Your Translation Model
J. Yoo
Jaewoo Kang
23
0
0
24 May 2022
Life after BERT: What do Other Muppets Understand about Language?
Life after BERT: What do Other Muppets Understand about Language?
Vladislav Lialin
Kevin Zhao
Namrata Shivagunde
Anna Rumshisky
49
6
0
21 May 2022
Adaptable Adapters
Adaptable Adapters
N. Moosavi
Quentin Delfosse
Kristian Kersting
Iryna Gurevych
56
21
0
03 May 2022
Attention Mechanism in Neural Networks: Where it Comes and Where it Goes
Attention Mechanism in Neural Networks: Where it Comes and Where it Goes
Derya Soydaner
3DV
44
150
0
27 Apr 2022
Merging of neural networks
Merging of neural networks
Martin Pasen
Vladimír Boza
FedML
MoMe
30
2
0
21 Apr 2022
Paying More Attention to Self-attention: Improving Pre-trained Language
  Models via Attention Guiding
Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding
Shanshan Wang
Zhumin Chen
Zhaochun Ren
Huasheng Liang
Qiang Yan
Pengjie Ren
33
9
0
06 Apr 2022
CipherDAug: Ciphertext based Data Augmentation for Neural Machine
  Translation
CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation
Nishant Kambhatla
Logan Born
Anoop Sarkar
21
16
0
01 Apr 2022
Structured Pruning Learns Compact and Accurate Models
Structured Pruning Learns Compact and Accurate Models
Mengzhou Xia
Zexuan Zhong
Danqi Chen
VLM
11
180
0
01 Apr 2022
TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
Ziqing Yang
Yiming Cui
Zhigang Chen
SyDa
VLM
31
12
0
30 Mar 2022
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token
  Selection
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection
Xin Huang
A. Khetan
Rene Bidart
Zohar Karnin
19
14
0
27 Mar 2022
One Country, 700+ Languages: NLP Challenges for Underrepresented
  Languages and Dialects in Indonesia
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Alham Fikri Aji
Genta Indra Winata
Fajri Koto
Samuel Cahyawijaya
Ade Romadhony
...
David Moeljadi
Radityo Eko Prasojo
Timothy Baldwin
Jey Han Lau
Sebastian Ruder
42
100
0
24 Mar 2022
Training-free Transformer Architecture Search
Training-free Transformer Architecture Search
Qinqin Zhou
Kekai Sheng
Xiawu Zheng
Ke Li
Xing Sun
Yonghong Tian
Jie Chen
Rongrong Ji
ViT
45
46
0
23 Mar 2022
Delta Keyword Transformer: Bringing Transformers to the Edge through
  Dynamically Pruned Multi-Head Self-Attention
Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention
Zuzana Jelčicová
Marian Verhelst
28
5
0
20 Mar 2022
Gaussian Multi-head Attention for Simultaneous Machine Translation
Gaussian Multi-head Attention for Simultaneous Machine Translation
Shaolei Zhang
Yang Feng
24
22
0
17 Mar 2022
A Novel Perspective to Look At Attention: Bi-level Attention-based
  Explainable Topic Modeling for News Classification
A Novel Perspective to Look At Attention: Bi-level Attention-based Explainable Topic Modeling for News Classification
Dairui Liu
Derek Greene
Ruihai Dong
28
10
0
14 Mar 2022
Data-Efficient Structured Pruning via Submodular Optimization
Data-Efficient Structured Pruning via Submodular Optimization
Marwa El Halabi
Suraj Srinivas
Simon Lacoste-Julien
22
18
0
09 Mar 2022
Understanding microbiome dynamics via interpretable graph representation
  learning
Understanding microbiome dynamics via interpretable graph representation learning
K. Melnyk
Kuba Weimann
Tim Conrad
24
5
0
02 Mar 2022
XAI for Transformers: Better Explanations through Conservative
  Propagation
XAI for Transformers: Better Explanations through Conservative Propagation
Ameen Ali
Thomas Schnake
Oliver Eberle
G. Montavon
Klaus-Robert Muller
Lior Wolf
FAtt
15
89
0
15 Feb 2022
ACORT: A Compact Object Relation Transformer for Parameter Efficient
  Image Captioning
ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning
J. Tan
Y. Tan
C. Chan
Joon Huang Chuah
VLM
ViT
31
15
0
11 Feb 2022
No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
  Training Large Transformer Models
No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
Chen Liang
Haoming Jiang
Simiao Zuo
Pengcheng He
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
T. Zhao
22
14
0
06 Feb 2022
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for
  Distilling Large Language Models
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
Dongkuan Xu
Subhabrata Mukherjee
Xiaodong Liu
Debadeepta Dey
Wenhui Wang
Xiang Zhang
Ahmed Hassan Awadallah
Jianfeng Gao
30
4
0
29 Jan 2022
Can Model Compression Improve NLP Fairness
Can Model Compression Improve NLP Fairness
Guangxuan Xu
Qingyuan Hu
31
27
0
21 Jan 2022
Sparse Interventions in Language Models with Differentiable Masking
Sparse Interventions in Language Models with Differentiable Masking
Nicola De Cao
Leon Schmid
Dieuwke Hupkes
Ivan Titov
40
27
0
13 Dec 2021
Explainable Deep Learning in Healthcare: A Methodological Survey from an
  Attribution View
Explainable Deep Learning in Healthcare: A Methodological Survey from an Attribution View
Di Jin
Elena Sergeeva
W. Weng
Geeticka Chauhan
Peter Szolovits
OOD
41
55
0
05 Dec 2021
Interpreting Deep Learning Models in Natural Language Processing: A
  Review
Interpreting Deep Learning Models in Natural Language Processing: A Review
Xiaofei Sun
Diyi Yang
Xiaoya Li
Tianwei Zhang
Yuxian Meng
Han Qiu
Guoyin Wang
Eduard H. Hovy
Jiwei Li
19
45
0
20 Oct 2021
Compositional Attention: Disentangling Search and Retrieval
Compositional Attention: Disentangling Search and Retrieval
Sarthak Mittal
Sharath Chandra Raparthy
Irina Rish
Yoshua Bengio
Guillaume Lajoie
22
20
0
18 Oct 2021
On the Pitfalls of Analyzing Individual Neurons in Language Models
On the Pitfalls of Analyzing Individual Neurons in Language Models
Omer Antverg
Yonatan Belinkov
MILM
30
50
0
14 Oct 2021
Global Vision Transformer Pruning with Hessian-Aware Saliency
Global Vision Transformer Pruning with Hessian-Aware Saliency
Huanrui Yang
Hongxu Yin
Maying Shen
Pavlo Molchanov
Hai Helen Li
Jan Kautz
ViT
30
39
0
10 Oct 2021
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
Zhengyan Zhang
Yankai Lin
Zhiyuan Liu
Peng Li
Maosong Sun
Jie Zhou
MoE
29
118
0
05 Oct 2021
On the Prunability of Attention Heads in Multilingual BERT
On the Prunability of Attention Heads in Multilingual BERT
Aakriti Budhraja
Madhura Pande
Pratyush Kumar
Mitesh M. Khapra
52
4
0
26 Sep 2021
Incorporating Residual and Normalization Layers into Analysis of Masked
  Language Models
Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
Goro Kobayashi
Tatsuki Kuribayashi
Sho Yokoi
Kentaro Inui
160
46
0
15 Sep 2021
The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with
  Transformer Encoders
The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders
Han He
Jinho Choi
56
87
0
14 Sep 2021
The Grammar-Learning Trajectories of Neural Language Models
The Grammar-Learning Trajectories of Neural Language Models
Leshem Choshen
Guy Hacohen
D. Weinshall
Omri Abend
29
28
0
13 Sep 2021
Modeling Concentrated Cross-Attention for Neural Machine Translation
  with Gaussian Mixture Model
Modeling Concentrated Cross-Attention for Neural Machine Translation with Gaussian Mixture Model
Shaolei Zhang
Yang Feng
18
23
0
11 Sep 2021
Document-level Entity-based Extraction as Template Generation
Document-level Entity-based Extraction as Template Generation
Kung-Hsiang Huang
Sam Tang
Nanyun Peng
22
54
0
10 Sep 2021
Block Pruning For Faster Transformers
Block Pruning For Faster Transformers
François Lagunas
Ella Charlaix
Victor Sanh
Alexander M. Rush
VLM
33
219
0
10 Sep 2021
Bag of Tricks for Optimizing Transformer Efficiency
Bag of Tricks for Optimizing Transformer Efficiency
Ye Lin
Yanyang Li
Tong Xiao
Jingbo Zhu
34
6
0
09 Sep 2021
Transformers in the loop: Polarity in neural models of language
Transformers in the loop: Polarity in neural models of language
Lisa Bylinina
Alexey Tikhonov
38
0
0
08 Sep 2021
Sparsity and Sentence Structure in Encoder-Decoder Attention of
  Summarization Systems
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems
Potsawee Manakul
Mark Gales
21
5
0
08 Sep 2021
Enjoy the Salience: Towards Better Transformer-based Faithful
  Explanations with Word Salience
Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience
G. Chrysostomou
Nikolaos Aletras
32
16
0
31 Aug 2021
T3-Vis: a visual analytic framework for Training and fine-Tuning
  Transformers in NLP
T3-Vis: a visual analytic framework for Training and fine-Tuning Transformers in NLP
Raymond Li
Wen Xiao
Lanjun Wang
Hyeju Jang
Giuseppe Carenini
ViT
31
23
0
31 Aug 2021
Layer-wise Model Pruning based on Mutual Information
Layer-wise Model Pruning based on Mutual Information
Chun Fan
Jiwei Li
Xiang Ao
Fei Wu
Yuxian Meng
Xiaofei Sun
48
19
0
28 Aug 2021
Differentiable Subset Pruning of Transformer Heads
Differentiable Subset Pruning of Transformer Heads
Jiaoda Li
Ryan Cotterell
Mrinmaya Sachan
45
53
0
10 Aug 2021
A Dynamic Head Importance Computation Mechanism for Neural Machine
  Translation
A Dynamic Head Importance Computation Mechanism for Neural Machine Translation
Akshay Goindani
Manish Shrivastava
27
4
0
03 Aug 2021
Previous
12345
Next