ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.05144
  4. Cited By
Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture of Attention Heads: Selecting Attention Heads Per Token

11 October 2022
Xiaofeng Zhang
Songlin Yang
Zeyu Huang
Jie Zhou
Wenge Rong
Zhang Xiong
    MoE
ArXiv (abs)PDFHTML

Papers citing "Mixture of Attention Heads: Selecting Attention Heads Per Token"

34 / 34 papers shown
Title
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
Wei Tao
Haocheng Lu
Xiaoyang Qu
Bin Zhang
Kai Lu
Jiguang Wan
Jianzong Wang
MQMoE
18
0
0
09 Jun 2025
Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation
Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation
Hao Guo
Erpeng Xue
Lei Huang
Shichao Wang
Xiaolei Wang
Lei Wang
Jinpeng Wang
Sheng Chen
65
0
0
22 May 2025
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Jingcong Liang
Siyuan Wang
Miren Tian
Yitong Li
Duyu Tang
Zhongyu Wei
MoE
92
0
0
21 May 2025
UMoE: Unifying Attention and FFN with Shared Experts
UMoE: Unifying Attention and FFN with Shared Experts
Yuanhang Yang
Chaozheng Wang
Jing Li
MoE
67
0
0
12 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoEVLM
278
2
0
01 May 2025
MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification
MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification
Yichu Xu
Di Wang
Hongzan Jiao
Li Zhang
Lefei Zhang
Mamba
133
0
0
29 Apr 2025
RouterKT: Mixture-of-Experts for Knowledge Tracing
RouterKT: Mixture-of-Experts for Knowledge Tracing
Han Liao
Shuaishuai Zu
117
0
0
11 Apr 2025
Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration
Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration
Shihao Zhou
Dayu Li
Jinshan Pan
Juncheng Zhou
Jinglei Shi
Jufeng Yang
112
0
0
26 Mar 2025
VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention
Jiangning Wei
Lixiong Qin
Bo Yu
Tianjian Zou
Chuhan Yan
Dandan Xiao
Yang Yu
Lan Yang
Ke Li
Jun Liu
67
0
0
14 Mar 2025
Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien
Charles-Étienne Joseph
Zain Sarwar
Ashwinee Panda
Anirban Das
Shi-Xiong Zhang
Stephen Rawls
Siyang Song
Eugene Belilovsky
Irina Rish
MoE
116
0
0
06 Mar 2025
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs
Yuhang Zhou
Giannis Karamanolakis
Victor Soto
Anna Rumshisky
Mayank Kulkarni
Furong Huang
Wei Ai
Jianhua Lu
MoMe
214
3
0
03 Feb 2025
MoH: Multi-Head Attention as Mixture-of-Head Attention
MoH: Multi-Head Attention as Mixture-of-Head Attention
Peng Jin
Bo Zhu
Li Yuan
Shuicheng Yan
MoE
103
18
0
15 Oct 2024
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Shangbin Feng
Zifeng Wang
Yike Wang
Sayna Ebrahimi
Hamid Palangi
...
Nathalie Rauschmayr
Yejin Choi
Yulia Tsvetkov
Chen-Yu Lee
Tomas Pfister
MoMe
103
9
0
15 Oct 2024
Exploring the Benefit of Activation Sparsity in Pre-training
Exploring the Benefit of Activation Sparsity in Pre-training
Zhengyan Zhang
Chaojun Xiao
Qiujieli Qin
Yankai Lin
Zhiyuan Zeng
Xu Han
Zhiyuan Liu
Ruobing Xie
Maosong Sun
Jie Zhou
MoE
122
4
0
04 Oct 2024
BAM! Just Like That: Simple and Efficient Parameter Upcycling for
  Mixture of Experts
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Qizhen Zhang
Nikolas Gritsch
Dwaraknath Gnaneshwar
Simon Guo
David Cairuz
...
Jakob N. Foerster
Phil Blunsom
Sebastian Ruder
Ahmet Üstün
Acyr Locatelli
MoMeMoE
104
9
0
15 Aug 2024
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large
  Language Models
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models
Zhongyu Zhao
Menghang Dong
Rongyu Zhang
Wenzhao Zheng
Yunpeng Zhang
Huanrui Yang
Dalong Du
Kurt Keutzer
Shanghang Zhang
100
0
0
15 Aug 2024
Layerwise Recurrent Router for Mixture-of-Experts
Layerwise Recurrent Router for Mixture-of-Experts
Zihan Qiu
Zeyu Huang
Shuang Cheng
Yizhi Zhou
Zili Wang
Ivan Titov
Jie Fu
MoE
153
2
0
13 Aug 2024
A Survey on Mixture of Experts in Large Language Models
A Survey on Mixture of Experts in Large Language Models
Weilin Cai
Juyong Jiang
Fan Wang
Jing Tang
Sunghun Kim
Jiayi Huang
MoE
66
70
0
26 Jun 2024
A Closer Look into Mixture-of-Experts in Large Language Models
A Closer Look into Mixture-of-Experts in Large Language Models
Ka Man Lo
Zeyu Huang
Zihan Qiu
Zili Wang
Jie Fu
MoE
118
14
0
26 Jun 2024
MoEUT: Mixture-of-Experts Universal Transformers
MoEUT: Mixture-of-Experts Universal Transformers
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
Christopher Potts
Christopher D. Manning
MoE
88
11
0
25 May 2024
Improving Transformers with Dynamically Composable Multi-Head Attention
Improving Transformers with Dynamically Composable Multi-Head Attention
Da Xiao
Qingye Meng
Shengping Li
Xingyuan Yuan
58
4
0
14 May 2024
CATS: Contextually-Aware Thresholding for Sparsity in Large Language
  Models
CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models
Je-Yong Lee
Donghyun Lee
Genghan Zhang
Mo Tiwari
Azalia Mirhoseini
71
21
0
12 Apr 2024
Dense Training, Sparse Inference: Rethinking Training of
  Mixture-of-Experts Language Models
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Bowen Pan
Songlin Yang
Haokun Liu
Mayank Mishra
Gaoyuan Zhang
Aude Oliva
Colin Raffel
Yikang Shen
MoE
93
22
0
08 Apr 2024
Boosting Continual Learning of Vision-Language Models via
  Mixture-of-Experts Adapters
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Jiazuo Yu
Yunzhi Zhuge
Lu Zhang
Ping Hu
Dong Wang
Huchuan Lu
You He
VLMKELMCLLOODD
189
87
0
18 Mar 2024
Scattered Mixture-of-Experts Implementation
Scattered Mixture-of-Experts Implementation
Shawn Tan
Songlin Yang
Yikang Shen
Aaron Courville
MoE
52
11
0
13 Mar 2024
Conditional computation in neural networks: principles and research
  trends
Conditional computation in neural networks: principles and research trends
Simone Scardapane
Alessandro Baiocchi
Alessio Devoto
V. Marsocci
Pasquale Minervini
Jary Pomponi
97
2
0
12 Mar 2024
Adaptive Computation Modules: Granular Conditional Computation For
  Efficient Inference
Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference
Bartosz Wójcik
Alessio Devoto
Karol Pustelnik
Pasquale Minervini
Simone Scardapane
86
6
0
15 Dec 2023
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Róbert Csordás
Piotr Piekos
Kazuki Irie
Jürgen Schmidhuber
MoE
55
16
0
13 Dec 2023
Unlocking Emergent Modularity in Large Language Models
Unlocking Emergent Modularity in Large Language Models
Zihan Qiu
Zeyu Huang
Jie Fu
87
10
0
17 Oct 2023
Sparse Universal Transformer
Sparse Universal Transformer
Shawn Tan
Songlin Yang
Zhenfang Chen
Aaron Courville
Chuang Gan
MoE
82
15
0
11 Oct 2023
Experts Weights Averaging: A New General Training Scheme for Vision
  Transformers
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Yongqian Huang
Peng Ye
Xiaoshui Huang
Sheng Li
Tao Chen
Tong He
Wanli Ouyang
MoMe
84
9
0
11 Aug 2023
ModuleFormer: Modularity Emerges from Mixture-of-Experts
ModuleFormer: Modularity Emerges from Mixture-of-Experts
Songlin Yang
Zheyu Zhang
Tianyou Cao
Shawn Tan
Zhenfang Chen
Chuang Gan
KELMMoE
54
10
0
07 Jun 2023
EIT: Enhanced Interactive Transformer
EIT: Enhanced Interactive Transformer
Tong Zheng
Bei Li
Huiwen Bao
Tong Xiao
Jingbo Zhu
119
2
0
20 Dec 2022
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners
Zitian Chen
Songlin Yang
Mingyu Ding
Zhenfang Chen
Hengshuang Zhao
E. Learned-Miller
Chuang Gan
MoE
43
15
0
15 Dec 2022
1