Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1911.03864
Cited By
Improving Transformer Models by Reordering their Sublayers
10 November 2019
Ofir Press
Noah A. Smith
Omer Levy
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Improving Transformer Models by Reordering their Sublayers"
26 / 26 papers shown
Title
Improving Routing in Sparse Mixture of Experts with Graph of Tokens
Tam Minh Nguyen
Ngoc N. Tran
Khai Nguyen
Richard G. Baraniuk
MoE
66
0
0
01 May 2025
MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
R. Teo
Tan M. Nguyen
MoE
33
3
0
18 Oct 2024
Evolutionary Neural Architecture Search for Transformer in Knowledge Tracing
Shangshang Yang
Xiaoshan Yu
Ye Tian
Xueming Yan
Haiping Ma
Xingyi Zhang
ViT
KELM
AI4Ed
21
2
0
02 Oct 2023
Centered Self-Attention Layers
Ameen Ali
Tomer Galanti
Lior Wolf
34
6
0
02 Jun 2023
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Tao Lei
Junwen Bai
Siddhartha Brahma
Joshua Ainslie
Kenton Lee
...
Vincent Zhao
Yuexin Wu
Bo-wen Li
Yu Zhang
Ming-Wei Chang
BDL
AI4CE
30
54
0
11 Apr 2023
Interpretability in Activation Space Analysis of Transformers: A Focused Survey
Soniya Vijayakumar
AI4CE
35
3
0
22 Jan 2023
NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
Haoxin Li
Phillip Keung
Daniel Cheng
Jungo Kasai
Noah A. Smith
25
3
0
11 Jan 2023
A Neural ODE Interpretation of Transformer Layers
Yaofeng Desmond Zhong
Tongtao Zhang
Amit Chakraborty
Biswadip Dey
28
9
0
12 Dec 2022
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Xiaozhi Wang
Kaiyue Wen
Zhengyan Zhang
Lei Hou
Zhiyuan Liu
Juanzi Li
MILM
MoE
27
50
0
14 Nov 2022
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Kwangyoun Kim
Felix Wu
Yifan Peng
Jing Pan
Prashant Sridhar
Kyu Jeong Han
Shinji Watanabe
61
105
0
30 Sep 2022
Efficient Sparsely Activated Transformers
Salar Latifi
Saurav Muralidharan
M. Garland
MoE
21
2
0
31 Aug 2022
Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
Yifan Peng
Siddharth Dalmia
Ian Lane
Shinji Watanabe
30
143
0
06 Jul 2022
Structured Pruning Learns Compact and Accurate Models
Mengzhou Xia
Zexuan Zhong
Danqi Chen
VLM
9
177
0
01 Apr 2022
TrimBERT: Tailoring BERT for Trade-offs
S. N. Sridhar
Anthony Sarah
Sairam Sundaresan
MQ
21
4
0
24 Feb 2022
NormFormer: Improved Transformer Pretraining with Extra Normalization
Sam Shleifer
Jason Weston
Myle Ott
AI4CE
33
74
0
18 Oct 2021
GNN-LM: Language Modeling based on Global Contexts via GNN
Yuxian Meng
Shi Zong
Xiaoya Li
Xiaofei Sun
Tianwei Zhang
Fei Wu
Jiwei Li
LRM
24
37
0
17 Oct 2021
Thinking Like Transformers
Gail Weiss
Yoav Goldberg
Eran Yahav
AI4CE
35
127
0
13 Jun 2021
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
53
1,088
0
08 Jun 2021
Which transformer architecture fits my data? A vocabulary bottleneck in self-attention
Noam Wies
Yoav Levine
Daniel Jannai
Amnon Shashua
40
20
0
09 May 2021
Finetuning Pretrained Transformers into RNNs
Jungo Kasai
Hao Peng
Yizhe Zhang
Dani Yogatama
Gabriel Ilharco
Nikolaos Pappas
Yi Mao
Weizhu Chen
Noah A. Smith
36
63
0
24 Mar 2021
Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press
Noah A. Smith
M. Lewis
230
89
0
31 Dec 2020
Reservoir Transformers
Sheng Shen
Alexei Baevski
Ari S. Morcos
Kurt Keutzer
Michael Auli
Douwe Kiela
35
17
0
30 Dec 2020
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
39
745
0
29 Dec 2020
The Depth-to-Width Interplay in Self-Attention
Yoav Levine
Noam Wies
Or Sharir
Hofit Bata
Amnon Shashua
30
45
0
22 Jun 2020
Normalized Attention Without Probability Cage
Oliver Richter
Roger Wattenhofer
14
21
0
19 May 2020
Neural Architecture Search with Reinforcement Learning
Barret Zoph
Quoc V. Le
271
5,327
0
05 Nov 2016
1