ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.06126
  4. Cited By
Merging Feed-Forward Sublayers for Compressed Transformers
v1v2 (latest)

Merging Feed-Forward Sublayers for Compressed Transformers

10 January 2025
Neha Verma
Kenton W. Murray
Kevin Duh
    AI4CE
ArXiv (abs)PDFHTML

Papers citing "Merging Feed-Forward Sublayers for Compressed Transformers"

37 / 37 papers shown
Title
Merging Text Transformer Models from Different Initializations
Merging Text Transformer Models from Different Initializations
Neha Verma
Maha Elbayad
MoMe
119
8
0
01 Mar 2024
MobileLLM: Optimizing Sub-billion Parameter Language Models for
  On-Device Use Cases
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Zechun Liu
Changsheng Zhao
Forrest N. Iandola
Chen Lai
Yuandong Tian
...
Ernie Chang
Yangyang Shi
Raghuraman Krishnamoorthi
Liangzhen Lai
Vikas Chandra
ALM
139
103
0
22 Feb 2024
LaCo: Large Language Model Pruning via Layer Collapse
LaCo: Large Language Model Pruning via Layer Collapse
Yifei Yang
Zouying Cao
Hai Zhao
90
64
0
17 Feb 2024
OLMo: Accelerating the Science of Language Models
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld
Iz Beltagy
Pete Walsh
Akshita Bhagia
Rodney Michael Kinney
...
Jesse Dodge
Kyle Lo
Luca Soldaini
Noah A. Smith
Hanna Hajishirzi
OSLM
213
413
0
01 Feb 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
  Pretraining Research
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini
Rodney Michael Kinney
Akshita Bhagia
Dustin Schwenk
David Atkinson
...
Hanna Hajishirzi
Iz Beltagy
Dirk Groeneveld
Jesse Dodge
Kyle Lo
131
282
0
31 Jan 2024
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Saleh Ashkboos
Maximilian L. Croci
Marcelo Gennari do Nascimento
Torsten Hoefler
James Hensman
VLM
220
186
0
26 Jan 2024
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
  Routing Policy
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
Pingzhi Li
Zhenyu Zhang
Prateek Yadav
Yi-Lin Sung
Yu Cheng
Mohit Bansal
Tianlong Chen
MoMe
85
39
0
02 Oct 2023
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers
Artidoro Pagnoni
Ari Holtzman
Luke Zettlemoyer
ALM
165
2,644
0
23 May 2023
Analyzing Feed-Forward Blocks in Transformers through the Lens of
  Attention Maps
Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
Goro Kobayashi
Tatsuki Kuribayashi
Sho Yokoi
Kentaro Inui
112
16
0
01 Feb 2023
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
  Transformers
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Zong-xiao Li
Chong You
Srinadh Bhojanapalli
Daliang Li
A. S. Rawat
...
Kenneth Q Ye
Felix Chern
Felix X. Yu
Ruiqi Guo
Surinder Kumar
MoE
106
97
0
12 Oct 2022
Git Re-Basin: Merging Models modulo Permutation Symmetries
Git Re-Basin: Merging Models modulo Permutation Symmetries
Samuel K. Ainsworth
J. Hayase
S. Srinivasa
MoMe
339
344
0
11 Sep 2022
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers
M. Lewis
Younes Belkada
Luke Zettlemoyer
MQ
200
666
0
15 Aug 2022
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
  at Unprecedented Scale
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Reza Yazdani Aminabadi
Samyam Rajbhandari
Minjia Zhang
A. A. Awan
Cheng-rong Li
...
Elton Zheng
Jeff Rasley
Shaden Smith
Olatunji Ruwase
Yuxiong He
101
371
0
30 Jun 2022
The Role of Permutation Invariance in Linear Mode Connectivity of Neural
  Networks
The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks
R. Entezari
Hanie Sedghi
O. Saukh
Behnam Neyshabur
MoMe
108
238
0
12 Oct 2021
Block Pruning For Faster Transformers
Block Pruning For Faster Transformers
François Lagunas
Ella Charlaix
Victor Sanh
Alexander M. Rush
VLM
76
225
0
10 Sep 2021
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRLAI4TSAI4CEALMAIMat
834
10,659
0
17 Jun 2021
Lessons on Parameter Sharing across Layers in Transformers
Lessons on Parameter Sharing across Layers in Transformers
Sho Takase
Shun Kiyono
111
87
0
13 Apr 2021
Subformer: Exploring Weight Sharing for Parameter Efficiency in
  Generative Transformers
Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers
Machel Reid
Edison Marrese-Taylor
Y. Matsuo
MoE
112
48
0
01 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
795
41,946
0
22 Oct 2020
The Tatoeba Translation Challenge -- Realistic Data Sets for Low
  Resource and Multilingual MT
The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT
Jörg Tiedemann
242
171
0
13 Oct 2020
Optimizing Mode Connectivity via Neuron Alignment
Optimizing Mode Connectivity via Neuron Alignment
N. Joseph Tatro
Pin-Yu Chen
Payel Das
Igor Melnyk
P. Sattigeri
Rongjie Lai
MoMe
328
82
0
05 Sep 2020
On the Effect of Dropping Layers of Pre-trained Transformer Models
On the Effect of Dropping Layers of Pre-trained Transformer Models
Hassan Sajjad
Fahim Dalvi
Nadir Durrani
Preslav Nakov
83
143
0
08 Apr 2020
Training Large Neural Networks with Constant Memory using a New
  Execution Algorithm
Training Large Neural Networks with Constant Memory using a New Execution Algorithm
B. Pudipeddi
Maral Mesmakhosroshahi
Jinwen Xi
S. Bharadwaj
109
58
0
13 Feb 2020
GLU Variants Improve Transformer
GLU Variants Improve Transformer
Noam M. Shazeer
235
1,029
0
12 Feb 2020
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive
  Summarization
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
Bogdan Gliwa
Iwona Mochol
M. Biesek
A. Wawer
172
640
0
27 Nov 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language
  Representations
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
SSLAIMat
640
6,488
0
26 Sep 2019
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy
  Lifting, the Rest Can Be Pruned
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Elena Voita
David Talbot
F. Moiseev
Rico Sennrich
Ivan Titov
167
1,154
0
23 May 2019
Similarity of Neural Network Representations Revisited
Similarity of Neural Network Representations Revisited
Simon Kornblith
Mohammad Norouzi
Honglak Lee
Geoffrey E. Hinton
194
1,442
0
01 May 2019
Universal Transformers
Universal Transformers
Mostafa Dehghani
Stephan Gouws
Oriol Vinyals
Jakob Uszkoreit
Lukasz Kaiser
144
757
0
10 Jul 2018
A Call for Clarity in Reporting BLEU Scores
A Call for Clarity in Reporting BLEU Scores
Matt Post
253
3,001
0
23 Apr 2018
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
1.0K
133,589
0
12 Jun 2017
Tying Word Vectors and Word Classifiers: A Loss Framework for Language
  Modeling
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Hakan Inan
Khashayar Khosravi
R. Socher
137
386
0
04 Nov 2016
Pointer Sentinel Mixture Models
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
411
2,905
0
26 Sep 2016
Using the Output Embedding to Improve Language Models
Using the Output Embedding to Improve Language Models
Ofir Press
Lior Wolf
111
738
0
20 Aug 2016
Convergent Learning: Do different neural networks learn the same
  representations?
Convergent Learning: Do different neural networks learn the same representations?
Yixuan Li
J. Yosinski
Jeff Clune
Hod Lipson
John E. Hopcroft
SSL
170
372
0
24 Nov 2015
Distilling the Knowledge in a Neural Network
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton
Oriol Vinyals
J. Dean
FedML
372
19,823
0
09 Mar 2015
ImageNet Large Scale Visual Recognition Challenge
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLMObjD
1.8K
39,734
0
01 Sep 2014
1