Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2206.01861
Cited By
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
4 June 2022
Z. Yao
Reza Yazdani Aminabadi
Minjia Zhang
Xiaoxia Wu
Conglong Li
Yuxiong He
VLM
MQ
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers"
50 / 324 papers shown
Title
Optimizing Large Language Model Training Using FP4 Quantization
Ruizhe Wang
Yeyun Gong
Xiao Liu
Guoshuai Zhao
Ziyue Yang
Baining Guo
Zhengjun Zha
Peng Cheng
MQ
67
5
0
28 Jan 2025
HadamRNN: Binary and Sparse Ternary Orthogonal RNNs
Armand Foucault
Franck Mamalet
François Malgouyres
MQ
74
0
0
28 Jan 2025
OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
Xing Hu
Yuan Cheng
Dawei Yang
Zukang Xu
Zhihang Yuan
Jiangyong Yu
Chen Xu
Zhe Jiang
Sifan Zhou
MQ
39
5
0
23 Jan 2025
Irrational Complex Rotations Empower Low-bit Optimizers
Zhen Tian
Wayne Xin Zhao
Zhicheng Dou
MQ
46
0
0
22 Jan 2025
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach
Alireza Ghaffari
Sharareh Younesian
Boxing Chen
Vahid Partovi Nia
M. Asgharian
MQ
61
0
0
17 Jan 2025
FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
Yuji Chai
Mujin Kwen
David Brooks
Gu-Yeon Wei
MQ
44
3
0
13 Jan 2025
Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning
Zhen Li
Yupeng Su
Runming Yang
C. Xie
Zehua Wang
Zhongwei Xie
Ngai Wong
Hongxia Yang
MQ
LRM
51
3
0
06 Jan 2025
Pushing the Envelope of Low-Bit LLM via Dynamic Error Compensation
Y. Park
Jake Hyun
Hojoon Kim
Jae W. Lee
MQ
46
0
0
31 Dec 2024
LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
Binrui Zeng
Bin Ji
Xiaodong Liu
Jie Yu
Shasha Li
Jun Ma
Xiaopeng Li
Shangwen Wang
Xinran Hong
Yongtao Tang
MQ
42
1
0
24 Dec 2024
Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference
Jorge García-Carrasco
A. Maté
Juan Trujillo
73
0
0
20 Dec 2024
SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization
Runsheng Bai
Qiang Liu
B. Liu
MQ
72
1
0
05 Dec 2024
CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models
Amitash Nanda
Sree Bhargavi Balija
D. Sahoo
MQ
64
0
0
03 Dec 2024
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors
M. Rakka
Jiajian Li
Guohao Dai
A. Eltawil
M. Fouda
Fadi J. Kurdahi
70
1
0
26 Nov 2024
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem
Vladimir Malinovskii
Andrei Panferov
Ivan Ilin
Han Guo
Peter Richtárik
Dan Alistarh
MQ
78
7
0
26 Nov 2024
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution
Libo Zhu
Jiajian Li
Haotong Qin
W. J. Li
Yulun Zhang
Yong Guo
Xiaokang Yang
DiffM
MQ
72
2
0
26 Nov 2024
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Yu Zhang
Hao Wu
Lancheng Zou
Wulong Liu
Hui-Ling Zhen
M. Yuan
Bei Yu
MQ
76
1
0
25 Nov 2024
Ex Uno Pluria: Insights on Ensembling in Low Precision Number Systems
G. Nam
Juho Lee
79
0
0
22 Nov 2024
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Zehua Pei
Hui-Ling Zhen
Xianzhi Yu
Sinno Jialin Pan
M. Yuan
Bei Yu
AI4CE
89
0
0
21 Nov 2024
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
Mohammadali Shakerdargah
Shan Lu
Chao Gao
Di Niu
72
0
0
20 Nov 2024
Bi-Mamba: Towards Accurate 1-Bit State Space Models
Shengkun Tang
Liqun Ma
Yiming Li
Mingjie Sun
Zhiqiang Shen
Mamba
75
3
0
18 Nov 2024
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
Yuzong Chen
Ahmed F. AbouElhamayed
Xilai Dai
Yang Wang
Marta Andronic
G. Constantinides
Mohamed S. Abdelfattah
MQ
108
1
0
18 Nov 2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Xuanlin Jiang
Yang Zhou
Shiyi Cao
Ion Stoica
Minlan Yu
47
8
0
02 Nov 2024
MoE-I
2
^2
2
: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
Yuanlin Duan
Wenqi Jia
Miao Yin
Yu Cheng
Bo Yuan
MoE
71
4
0
01 Nov 2024
Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis
Dong Wen
Z. Liu
Tong Yang
Tao Li
Tianyun Li
Chenglong Li
Jie Li
Zhigang Sun
50
0
0
01 Nov 2024
Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management
Tuowei Wang
Ruwen Fan
Minxing Huang
Zixu Hao
Kun Li
Ting Cao
Youyou Lu
Yaoxue Zhang
Ju Ren
50
2
0
25 Oct 2024
Pruning Foundation Models for High Accuracy without Retraining
Pu Zhao
Fei Sun
Xuan Shen
Pinrui Yu
Zhenglun Kong
Yanzhi Wang
Xue Lin
33
10
0
21 Oct 2024
SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Jinda Jia
Cong Xie
Hanlin Lu
Daoce Wang
Hao Feng
...
Baixi Sun
Yanghua Peng
Zhi-Li Zhang
Xin Liu
Dingwen Tao
MQ
30
4
0
20 Oct 2024
Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs
Zifei Xu
Sayeh Sharify
W. Yazar
T. Webb
Xin Wang
MQ
43
0
0
18 Oct 2024
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
29
10
0
17 Oct 2024
Channel-Wise Mixed-Precision Quantization for Large Language Models
Zihan Chen
Bike Xie
Jundong Li
Cong Shen
MQ
32
2
0
16 Oct 2024
Scaling laws for post-training quantized large language models
Zifei Xu
Alexander Lan
W. Yazar
T. Webb
Sayeh Sharify
Xin Wang
MQ
28
0
0
15 Oct 2024
Sorted Weight Sectioning for Energy-Efficient Unstructured Sparse DNNs on Compute-in-Memory Crossbars
Matheus Farias
H. T. Kung
18
1
0
15 Oct 2024
SLaNC: Static LayerNorm Calibration
Mahsa Salmani
Nikita Trukhanov
I. Soloveychik
MQ
31
0
0
14 Oct 2024
QEFT: Quantization for Efficient Fine-Tuning of LLMs
Changhun Lee
Jun-gyu Jin
Younghyun Cho
Eunhyeok Park
MQ
42
1
0
11 Oct 2024
CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
Wenyuan Liu
Xindian Ma
Peng Zhang
Yan Wang
MQ
29
1
0
10 Oct 2024
Q-VLM: Post-training Quantization for Large Vision-Language Models
Changyuan Wang
Ziwei Wang
Xiuwei Xu
Yansong Tang
Jie Zhou
Jiwen Lu
MQ
32
1
0
10 Oct 2024
On Efficient Variants of Segment Anything Model: A Survey
Xiaorui Sun
Xiaozhong Liu
H. Shen
Xiaofeng Zhu
Ping Hu
VLM
51
4
0
07 Oct 2024
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Aurick Qiao
Z. Yao
Samyam Rajbhandari
Yuxiong He
VLM
37
0
0
04 Oct 2024
ARB-LLM: Alternating Refined Binarizations for Large Language Models
Zhiteng Li
Xinyu Yan
Tianao Zhang
Haotong Qin
Dong Xie
Jiang Tian
Zhongchao Shi
Linghe Kong
Yulun Zhang
Xiaokang Yang
MQ
34
2
0
04 Oct 2024
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
Linke Song
Zixuan Pang
Wenhao Wang
Zihao Wang
XiaoFeng Wang
Hongbo Chen
Wei Song
Yier Jin
Dan Meng
Rui Hou
56
7
0
30 Sep 2024
CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information
Yuxin Wang
Minghua Ma
Zekun Wang
Jingchang Chen
Huiming Fan
Liping Shan
Qing Yang
Dongliang Xu
Ming Liu
Bing Qin
38
3
0
20 Sep 2024
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
Yanshu Wang
Tong Yang
Xiyan Liang
Guoan Wang
Hanning Lu
Xu Zhe
Yaoming Li
Li Weitao
MQ
42
3
0
18 Sep 2024
OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models
Jahyun Koo
Dahoon Park
Sangwoo Jung
Jaeha Kung
MQ
21
0
0
06 Sep 2024
Foundations of Large Language Model Compression -- Part 1: Weight Quantization
Sean I. Young
MQ
45
1
0
03 Sep 2024
The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information
Diyuan Wu
Ionut-Vlad Modoranu
M. Safaryan
Denis Kuznedelev
Dan Alistarh
29
1
0
30 Aug 2024
Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
Guanchen Li
Xiandong Zhao
Lian Liu
Zeping Li
Dong Li
Lu Tian
Jie He
Ashish Sirasao
E. Barsoum
VLM
32
0
0
20 Aug 2024
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
Chao Zeng
Songwei Liu
Yusheng Xie
Hong Liu
Xiaojian Wang
Miao Wei
Shu Yang
Fangmin Chen
Xing Mei
MQ
42
6
0
16 Aug 2024
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
Zhiwen Mo
Lei Wang
Jianyu Wei
Zhichen Zeng
Shijie Cao
...
Naifeng Jing
Ting Cao
Jilong Xue
Fan Yang
Mao Yang
54
4
0
12 Aug 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho
Minsu Kim
Hyunmin Choi
Guseul Heo
Jongse Park
38
9
0
10 Aug 2024
Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations
Leo Donisch
Sigurd Schacht
Carsten Lanquillon
30
2
0
06 Aug 2024
Previous
1
2
3
4
5
6
7
Next