ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.09748
  4. Cited By
Model Compression and Efficient Inference for Large Language Models: A
  Survey

Model Compression and Efficient Inference for Large Language Models: A Survey

15 February 2024
Wenxiao Wang
Wei Chen
Yicong Luo
Yongliu Long
Zhengkai Lin
Liye Zhang
Binbin Lin
Deng Cai
Xiaofei He
    MQ
ArXivPDFHTML

Papers citing "Model Compression and Efficient Inference for Large Language Models: A Survey"

50 / 76 papers shown
Title
Taming the Titans: A Survey of Efficient LLM Inference Serving
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
CUT: Pruning Pre-Trained Multi-Task Models into Compact Models for Edge Devices
CUT: Pruning Pre-Trained Multi-Task Models into Compact Models for Edge Devices
Jingxuan Zhou
Weidong Bao
Ji Wang
Zhengyi Zhong
32
0
0
14 Apr 2025
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
Yamato Arai
Yuma Ichikawa
MQ
29
0
0
13 Apr 2025
Token Level Routing Inference System for Edge Devices
Token Level Routing Inference System for Edge Devices
Jianshu She
Wenhao Zheng
Zhengzhong Liu
Hongyi Wang
Eric P. Xing
Huaxiu Yao
Qirong Ho
36
0
0
10 Apr 2025
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
Vishnu Kabir Chhabra
Mohammad Mahdi Khalili
AI4CE
28
0
0
05 Apr 2025
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency
Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency
E. J. Husom
Arda Goknil
Merve Astekin
Lwin Khin Shar
Andre Kåsen
S. Sen
Benedikt Andreas Mithassel
Ahmet Soylu
MQ
32
0
0
04 Apr 2025
Efficient Inference for Large Reasoning Models: A Survey
Efficient Inference for Large Reasoning Models: A Survey
Y. Liu
Jiaying Wu
Yufei He
Hongcheng Gao
Hongyu Chen
Baolong Bi
Jiaheng Zhang
Zhiqi Huang
Bryan Hooi
LLMAG
LRM
65
7
0
29 Mar 2025
Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent
Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent
Humza Nusrat
Bing Luo
Ryan Hall
Joshua Kim
H. Bagher-Ebadian
Anthony Doemer
B. Movsas
Kundan Thind
AI4CE
34
0
0
21 Mar 2025
DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems
Minoo Hosseinzadeh
Hana Khamfroush
70
0
0
03 Mar 2025
LLM Inference Acceleration via Efficient Operation Fusion
LLM Inference Acceleration via Efficient Operation Fusion
Mahsa Salmani
I. Soloveychik
64
0
0
24 Feb 2025
Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing
Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing
Qi Le
Enmao Diao
Ziyan Wang
Xinran Wang
Jie Ding
Li Yang
Ali Anwar
69
1
0
24 Feb 2025
When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
Weilan Wang
Yu Mao
Dongdong Tang
Hongchao Du
Nan Guan
Chun Jason Xue
MQ
62
1
0
24 Feb 2025
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
J. Zhao
M. Wang
Miao Zhang
Yuzhang Shang
Xuebo Liu
Yaowei Wang
Min Zhang
Liqiang Nie
MQ
58
1
0
18 Feb 2025
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Haozhao Wang
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
119
1
0
18 Dec 2024
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on
  Associative Processors
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors
M. Rakka
J. Li
Guohao Dai
A. Eltawil
M. Fouda
Fadi J. Kurdahi
65
1
0
26 Nov 2024
Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning
  Small Language Models
Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models
Y. Fu
Yin Yu
Xiaotian Han
Runchao Li
Xianxuan Long
Haotian Yu
Pan Li
SyDa
57
0
0
25 Nov 2024
Beyond Task Vectors: Selective Task Arithmetic Based on Importance
  Metrics
Beyond Task Vectors: Selective Task Arithmetic Based on Importance Metrics
Tian Bowen
Lai Songning
Wu Jiemin
Shuai Zhihao
Ge Shiming
Yue Yutao
MoMe
70
4
0
25 Nov 2024
An exploration of the effect of quantisation on energy consumption and
  inference time of StarCoder2
An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2
Pepijn de Reus
Ana Oprescu
Jelle Zuidema
MQ
85
1
0
15 Nov 2024
Software Performance Engineering for Foundation Model-Powered Software
  (FMware)
Software Performance Engineering for Foundation Model-Powered Software (FMware)
Haoxiang Zhang
Shi Chang
Arthur Leung
Kishanthan Thangarajah
Boyuan Chen
Hanan Lutfiyya
Ahmed E. Hassan
96
1
0
14 Nov 2024
Change Is the Only Constant: Dynamic LLM Slicing based on Layer
  Redundancy
Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy
Razvan-Gabriel Dumitru
Paul-Ioan Clotan
Vikas Yadav
Darius Peteleaza
Mihai Surdeanu
36
4
0
05 Nov 2024
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
Linke Song
Zixuan Pang
Wenhao Wang
Zihao Wang
XiaoFeng Wang
Hongbo Chen
Wei Song
Yier Jin
Dan Meng
Rui Hou
53
7
0
30 Sep 2024
Turn Every Application into an Agent: Towards Efficient
  Human-Agent-Computer Interaction with API-First LLM-Based Agents
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Junting Lu
Zhiyang Zhang
Fangkai Yang
Jue Zhang
Lu Wang
Chao Du
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
Qi Zhang
LLMAG
28
1
0
25 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
36
1
0
19 Sep 2024
Evaluating the Impact of Compression Techniques on Task-Specific
  Performance of Large Language Models
Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
Bishwash Khanal
Jeffery M. Capone
26
1
0
17 Sep 2024
Computer Vision Model Compression Techniques for Embedded Systems: A
  Survey
Computer Vision Model Compression Techniques for Embedded Systems: A Survey
Alexandre Lopes
Fernando Pereira dos Santos
D. Oliveira
Mauricio Schiezaro
Hélio Pedrini
28
5
0
15 Aug 2024
CELLM: An Efficient Communication in Large Language Models Training for
  Federated Learning
CELLM: An Efficient Communication in Large Language Models Training for Federated Learning
Raja Vavekanand
Kira Sam
42
0
0
30 Jul 2024
Compact Language Models via Pruning and Knowledge Distillation
Compact Language Models via Pruning and Knowledge Distillation
Saurav Muralidharan
Sharath Turuvekere Sreenivas
Raviraj Joshi
Marcin Chochowski
M. Patwary
M. Shoeybi
Bryan Catanzaro
Jan Kautz
Pavlo Molchanov
SyDa
MQ
34
37
0
19 Jul 2024
BlockPruner: Fine-grained Pruning for Large Language Models
BlockPruner: Fine-grained Pruning for Large Language Models
Longguang Zhong
Fanqi Wan
Ruijun Chen
Xiaojun Quan
Liangzhi Li
23
7
0
15 Jun 2024
BiSup: Bidirectional Quantization Error Suppression for Large Language
  Models
BiSup: Bidirectional Quantization Error Suppression for Large Language Models
Minghui Zou
Ronghui Guo
Sai Zhang
Xiaowang Zhang
Zhiyong Feng
MQ
29
1
0
24 May 2024
Model Quantization and Hardware Acceleration for Vision Transformers: A
  Comprehensive Survey
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey
Dayou Du
Gu Gong
Xiaowen Chu
MQ
32
7
0
01 May 2024
A Survey on Efficient Inference for Large Language Models
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu-Xiang Wang
46
82
0
22 Apr 2024
A Survey on the Memory Mechanism of Large Language Model based Agents
A Survey on the Memory Mechanism of Large Language Model based Agents
Zeyu Zhang
Xiaohe Bo
Chen Ma
Rui Li
Xu Chen
Quanyu Dai
Jieming Zhu
Zhenhua Dong
Ji-Rong Wen
LLMAG
KELM
36
106
0
21 Apr 2024
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for
  Large Language Models
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
Taiqiang Wu
Chaofan Tao
Jiahao Wang
Zhe Zhao
Ngai Wong
ALM
46
14
0
03 Apr 2024
LLM Inference Unveiled: Survey and Roofline Model Insights
LLM Inference Unveiled: Survey and Roofline Model Insights
Zhihang Yuan
Yuzhang Shang
Yang Zhou
Zhen Dong
Zhe Zhou
...
Yong Jae Lee
Yan Yan
Beidi Chen
Guangyu Sun
Kurt Keutzer
37
79
0
26 Feb 2024
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Saleh Ashkboos
Maximilian L. Croci
Marcelo Gennari do Nascimento
Torsten Hoefler
James Hensman
VLM
127
145
0
26 Jan 2024
Fast and Optimal Weight Update for Pruned Large Language Models
Fast and Optimal Weight Update for Pruned Large Language Models
Vladimír Boza
27
5
0
01 Jan 2024
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity
  Compensation
PanGu-πππ: Enhancing Language Model Architectures via Nonlinearity Compensation
Yunhe Wang
Hanting Chen
Yehui Tang
Tianyu Guo
Kai Han
...
Qinghua Xu
Qun Liu
Jun Yao
Chao Xu
Dacheng Tao
65
15
0
27 Dec 2023
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
Max Zimmer
Megi Andoni
Christoph Spiegel
S. Pokutta
VLM
50
10
0
23 Dec 2023
Mini-GPTs: Efficient Large Language Models through Contextual Pruning
Mini-GPTs: Efficient Large Language Models through Contextual Pruning
Tim Valicenti
Justice Vidal
Ritik Patnaik
46
9
0
20 Dec 2023
SPT: Fine-Tuning Transformer-based Language Models Efficiently with
  Sparsification
SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification
Yuntao Gui
Xiao Yan
Peiqi Yin
Han Yang
James Cheng
35
2
0
16 Dec 2023
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song
Zeyu Mi
Haotong Xie
Haibo Chen
BDL
122
120
0
16 Dec 2023
PromptMix: A Class Boundary Augmentation Method for Large Language Model
  Distillation
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation
Gaurav Sahu
Olga Vechtomova
Dzmitry Bahdanau
I. Laradji
VLM
47
24
0
22 Oct 2023
Democratizing Reasoning Ability: Tailored Learning from Large Language
  Model
Democratizing Reasoning Ability: Tailored Learning from Large Language Model
Zhaoyang Wang
Shaohan Huang
Yuxuan Liu
Jiahai Wang
Minghui Song
...
Haizhen Huang
Furu Wei
Weiwei Deng
Feng Sun
Qi Zhang
LRM
27
11
0
20 Oct 2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language
  Models
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
Hang Shao
Bei Liu
Bo Xiao
Ke Zeng
Guanglu Wan
Yanmin Qian
42
17
0
14 Oct 2023
Can pruning make Large Language Models more efficient?
Can pruning make Large Language Models more efficient?
Sia Gholami
Marwan Omar
26
12
0
06 Oct 2023
Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery
  Tickets from Large Models
Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models
A. Jaiswal
Shiwei Liu
Tianlong Chen
Ying Ding
Zhangyang Wang
VLM
32
22
0
18 Jun 2023
Distilling Step-by-Step! Outperforming Larger Language Models with Less
  Training Data and Smaller Model Sizes
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Lokesh Nagalapatti
Chun-Liang Li
Chih-Kuan Yeh
Hootan Nakhost
Yasuhisa Fujii
Alexander Ratner
Ranjay Krishna
Chen-Yu Lee
Tomas Pfister
ALM
217
499
0
03 May 2023
SCOTT: Self-Consistent Chain-of-Thought Distillation
SCOTT: Self-Consistent Chain-of-Thought Distillation
Jamie Yap
Zhengyang Wang
Zheng Li
K. Lynch
Bing Yin
Xiang Ren
LRM
59
92
0
03 May 2023
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale
  Instructions
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Minghao Wu
Abdul Waheed
Chiyu Zhang
Muhammad Abdul-Mageed
Alham Fikri Aji
ALM
129
119
0
27 Apr 2023
Instruction Tuning with GPT-4
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
159
579
0
06 Apr 2023
12
Next