MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

5 October 2021

Zhengyan Zhang

Yankai Lin

Zhiyuan Liu

Peng Li

Maosong Sun

Jie Zhou

Papers citing "MoEfication: Transformer Feed-forward Layers are Mixtures of Experts"

6 / 56 papers shown

Title
Gaussian Error Linear Units (GELUs) Dan Hendrycks Kevin Gimpel 163 4,958 0 27 Jun 2016
SQuAD: 100,000+ Questions for Machine Comprehension of Text Pranav Rajpurkar Jian Zhang Konstantin Lopyrev Percy Liang RALM 184 8,067 0 16 Jun 2016
Distilling the Knowledge in a Neural Network Geoffrey E. Hinton Oriol Vinyals J. Dean FedML 288 19,523 0 09 Mar 2015
Deep Learning of Representations: Looking Forward Yoshua Bengio 163 679 0 02 May 2013
Maxout Networks Ian Goodfellow David Warde-Farley M. Berk Mirza Aaron Courville Yoshua Bengio OOD 204 2,176 0 18 Feb 2013
Improving neural networks by preventing co-adaptation of feature detectors Geoffrey E. Hinton Nitish Srivastava A. Krizhevsky Ilya Sutskever Ruslan Salakhutdinov VLM 408 7,650 0 03 Jul 2012