Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.10438
Cited By
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
18 November 2022
Guangxuan Xiao
Ji Lin
Mickael Seznec
Hao Wu
Julien Demouth
Song Han
MQ
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models"
50 / 533 papers shown
Title
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
Zirui Liu
Qingquan Song
Q. Xiao
Sathiya Keerthi Selvaraj
Rahul Mazumder
Aman Gupta
Xia Hu
35
4
0
08 Jan 2024
IoT in the Era of Generative AI: Vision and Challenges
Xin Wang
Zhongwei Wan
Arvin Hekmati
M. Zong
Samiul Alam
Mi Zhang
Bhaskar Krishnamachari
32
15
0
03 Jan 2024
Fast and Optimal Weight Update for Pruned Large Language Models
Vladimír Boza
27
6
0
01 Jan 2024
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Xiangxiang Chu
Limeng Qiao
Xinyang Lin
Shuang Xu
Yang Yang
...
Fei Wei
Xinyu Zhang
Bo-Wen Zhang
Xiaolin Wei
Chunhua Shen
MLLM
39
34
0
28 Dec 2023
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Hongzheng Chen
Jiahao Zhang
Yixiao Du
Shaojie Xiang
Zichao Yue
Niansong Zhang
Yaohui Cai
Zhiru Zhang
65
34
0
23 Dec 2023
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song
Zeyu Mi
Haotong Xie
Haibo Chen
BDL
125
122
0
16 Dec 2023
Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language Models
Alireza Ghaffari
Justin Yu
Mahsa Ghazvini Nejad
M. Asgharian
Boxing Chen
Vahid Partovi Nia
33
2
0
14 Dec 2023
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Xiaoxia Wu
Haojun Xia
Stephen Youn
Zhen Zheng
Shiyang Chen
...
Reza Yazdani Aminabadi
Yuxiong He
Olatunji Ruwase
Leon Song
Zhewei Yao
78
8
0
14 Dec 2023
CBQ: Cross-Block Quantization for Large Language Models
Xin Ding
Xiaoyu Liu
Zhijun Tu
Yun-feng Zhang
Wei Li
...
Hanting Chen
Yehui Tang
Zhiwei Xiong
Baoqun Yin
Yunhe Wang
MQ
38
13
0
13 Dec 2023
Efficient Quantization Strategies for Latent Diffusion Models
Yuewei Yang
Xiaoliang Dai
Jialiang Wang
Peizhao Zhang
Hongbo Zhang
DiffM
MQ
29
13
0
09 Dec 2023
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
41
31
0
08 Dec 2023
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
Jiayi Pan
Chengcan Wang
Kaifu Zheng
Yangguang Li
Zhenyu Wang
Bin Feng
MQ
43
7
0
06 Dec 2023
Nonparametric Variational Regularisation of Pretrained Transformers
Fabio Fehr
James Henderson
43
0
0
01 Dec 2023
Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models
Mohamed R. Shoaib
Ze Wang
Milad Taleby Ahvanooey
Jun Zhao
30
41
0
29 Nov 2023
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs
Shivam Aggarwal
Hans Jakob Damsgaard
Alessandro Pappalardo
Giuseppe Franco
Thomas B. Preußer
Michaela Blott
Tulika Mitra
MQ
27
5
0
21 Nov 2023
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
Han Guo
P. Greengard
Eric P. Xing
Yoon Kim
MQ
38
43
0
20 Nov 2023
Zero redundancy distributed learning with differential privacy
Zhiqi Bu
Justin Chiu
Ruixuan Liu
Sheng Zha
George Karypis
51
8
0
20 Nov 2023
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
Youhe Jiang
Ran Yan
Xiaozhe Yao
Yang Zhou
Beidi Chen
Binhang Yuan
SyDa
30
10
0
20 Nov 2023
A Speed Odyssey for Deployable Quantization of LLMs
Qingyuan Li
Ran Meng
Yiduo Li
Bo-Wen Zhang
Liang Li
Yifan Lu
Xiangxiang Chu
Yerui Sun
Yuchen Xie
MQ
64
7
0
16 Nov 2023
Towards the Law of Capacity Gap in Distilling Language Models
Chen Zhang
Dawei Song
Zheyu Ye
Yan Gao
ELM
38
20
0
13 Nov 2023
AI-native Interconnect Framework for Integration of Large Language Model Technologies in 6G Systems
Sasu Tarkoma
Roberto Morabito
Jaakko Sauvola
41
19
0
10 Nov 2023
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Ying Sheng
Shiyi Cao
Dacheng Li
Coleman Hooper
Nicholas Lee
...
Banghua Zhu
Lianmin Zheng
Kurt Keutzer
Joseph E. Gonzalez
Ion Stoica
MoE
28
90
0
06 Nov 2023
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
Baisong Li
Xingwang Wang
Haixiao Xu
MQ
24
0
0
02 Nov 2023
Efficient LLM Inference on CPUs
Haihao Shen
Hanwen Chang
Bo Dong
Yu Luo
Hengyu Meng
MQ
15
17
0
01 Nov 2023
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Zhixu Du
Shiyu Li
Yuhao Wu
Xiangyu Jiang
Jingwei Sun
Qilin Zheng
Yongkai Wu
Ang Li
Hai Helen Li
Yiran Chen
MoE
43
12
0
29 Oct 2023
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
44
34
0
28 Oct 2023
ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers
Zhewei Yao
Reza Yazdani Aminabadi
Stephen Youn
Xiaoxia Wu
Elton Zheng
Yuxiong He
MQ
19
1
0
26 Oct 2023
FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing
Terence Jie Chua
Wen-li Yu
Junfeng Zhao
Kwok-Yan Lam
FedML
32
5
0
26 Oct 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu
Jue Wang
Tri Dao
Dinesh Manocha
Binhang Yuan
...
Anshumali Shrivastava
Ce Zhang
Yuandong Tian
Christopher Ré
Beidi Chen
BDL
25
192
0
26 Oct 2023
LLM Performance Predictors are good initializers for Architecture Search
Ganesh Jawahar
Muhammad Abdul-Mageed
L. Lakshmanan
Dujian Ding
LRM
55
19
0
25 Oct 2023
Vision Language Models in Autonomous Driving: A Survey and Outlook
Xingcheng Zhou
Mingyu Liu
Ekim Yurtsever
B. L. Žagar
Walter Zimmer
Hu Cao
Alois C. Knoll
VLM
42
39
0
22 Oct 2023
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Huaijie Wang
Lingxiao Ma
Fan Yang
Ruiping Wang
Yi Wu
Furu Wei
MQ
34
98
0
17 Oct 2023
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Wenhua Cheng
Yiyang Cai
Kaokao Lv
Haihao Shen
MQ
33
7
0
17 Oct 2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
Hang Shao
Bei Liu
Bo Xiao
Ke Zeng
Guanglu Wan
Yanmin Qian
44
17
0
14 Oct 2023
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Saleh Ashkboos
Ilia Markov
Elias Frantar
Tingxuan Zhong
Xincheng Wang
Jie Ren
Torsten Hoefler
Dan Alistarh
MQ
SyDa
126
22
0
13 Oct 2023
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
Yuxin Zhang
Lirui Zhao
Mingbao Lin
Yunyun Sun
Yiwu Yao
Xingjia Han
Jared Tanner
Shiwei Liu
Rongrong Ji
SyDa
45
40
0
13 Oct 2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Yixiao Li
Yifan Yu
Chen Liang
Pengcheng He
Nikos Karampatziakis
Weizhu Chen
Tuo Zhao
MQ
41
125
0
12 Oct 2023
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Jing Liu
Ruihao Gong
Xiuying Wei
Zhiwei Dong
Jianfei Cai
Bohan Zhuang
MQ
35
51
0
12 Oct 2023
CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
Yuhan Liu
Hanchen Li
Yihua Cheng
Siddhant Ray
Yuyang Huang
...
Ganesh Ananthanarayanan
Michael Maire
Henry Hoffmann
Ari Holtzman
Junchen Jiang
50
42
0
11 Oct 2023
Sparse Fine-tuning for Inference Acceleration of Large Language Models
Eldar Kurtic
Denis Kuznedelev
Elias Frantar
Michael Goin
Dan Alistarh
35
8
0
10 Oct 2023
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia
Tianyu Gao
Zhiyuan Zeng
Danqi Chen
40
268
0
10 Oct 2023
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang
Qianhui Wu
Chin-Yew Lin
Yuqing Yang
Lili Qiu
40
102
0
09 Oct 2023
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae
Jongwoo Ko
Hwanjun Song
SeYoung Yun
32
54
0
09 Oct 2023
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Lu Yin
You Wu
Zhenyu Zhang
Cheng-Yu Hsieh
Yaqing Wang
...
Mykola Pechenizkiy
Yi Liang
Michael Bendersky
Zhangyang Wang
Shiwei Liu
33
79
0
08 Oct 2023
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
Cheng Zhang
Jianyi Cheng
Ilia Shumailov
George A. Constantinides
Yiren Zhao
MQ
21
9
0
08 Oct 2023
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Song Guo
Jiahang Xu
Li Zhang
Mao Yang
25
14
0
08 Oct 2023
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Luoming Zhang
Wen Fei
Weijia Wu
Yefei He
Zhenyu Lou
Hong Zhou
MQ
25
5
0
07 Oct 2023
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh
Keivan Alizadeh-Vahid
Sachin Mehta
C. C. D. Mundo
Oncel Tuzel
Golnoosh Samei
Mohammad Rastegari
Mehrdad Farajtabar
126
61
0
06 Oct 2023
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman
Zhao Song
38
32
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
23
14
0
04 Oct 2023
Previous
1
2
3
...
10
11
8
9
Next