ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.10438
  4. Cited By
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
  Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

18 November 2022
Guangxuan Xiao
Ji Lin
Mickael Seznec
Hao Wu
Julien Demouth
Song Han
    MQ
ArXivPDFHTML

Papers citing "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models"

50 / 533 papers shown
Title
Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference
Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference
Tomer Gafni
Asaf Karnieli
Yair Hanani
MQ
12
0
0
20 May 2025
Scaling Law for Quantization-Aware Training
Scaling Law for Quantization-Aware Training
Mengzhao Chen
Chaoyi Zhang
Jing Liu
Yutao Zeng
Zeyue Xue
...
Yunshui Li
Jin Ma
Jie Huang
Xun Zhou
Ping Luo
MQ
7
0
0
20 May 2025
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
Fine-tuning Quantized Neural Networks with Zeroth-order Optimization
Sifeng Shang
Jiayi Zhou
Chenyu Lin
Minxian Li
Kaiyang Zhou
MQ
7
0
0
19 May 2025
Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization
Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization
Shihao Zhang
Haoyu Zhang
Ian Colbert
Rayan Saab
MQ
17
0
0
16 May 2025
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
Yinsicheng Jiang
Yao Fu
Yeqi Huang
Ping Nie
Zhan Lu
...
Dayou Du
Tairan Xu
Kai Zou
Edoardo Ponti
Luo Mai
MoE
19
0
0
16 May 2025
InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models
InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models
Zhen Li
Yupeng Su
Songmiao Wang
Runming Yang
C. Xie
...
Ming Li
Jiannong Cao
Yuan Xie
Ngai Wong
Hongxia Yang
MQ
12
0
0
16 May 2025
The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks
The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks
Zhonghao Lyu
Ming Xiao
Jie Xu
Mikael Skoglund
Marco Di Renzo
28
0
0
14 May 2025
QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads
QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads
Khurram Mazher
Saad Bin Nasir
MQ
52
0
0
12 May 2025
Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption
Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption
Jordan Fréry
Roman Bredehoft
Jakub Klemsa
Arthur Meyre
Andrei Stoian
31
0
0
12 May 2025
Semantic Retention and Extreme Compression in LLMs: Can We Have Both?
Semantic Retention and Extreme Compression in LLMs: Can We Have Both?
Stanislas Laborde
Martin Cousseau
Antoun Yaacoub
Lionel Prevost
MQ
23
0
0
12 May 2025
Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations
Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations
Patrick Blumenberg
Thomas Graave
Tim Fingscheidt
MQ
24
0
0
10 May 2025
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
HamidReza Imani
Jiaxin Peng
Peiman Mohseni
Abdolah Amirany
Tarek A. El-Ghazawi
MoE
31
0
0
10 May 2025
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
Haolin Zhang
Jeff Huang
35
0
0
09 May 2025
Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
Zhiyuan Chen
Keyi Li
Yifan Jia
Le Ye
Yufei Ma
DiffM
37
0
0
09 May 2025
Stability in Single-Peaked Strategic Resource Selection Games
Stability in Single-Peaked Strategic Resource Selection Games
Henri Zeiler
32
0
0
09 May 2025
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Haojie Duanmu
Xiuhong Li
Zhihang Yuan
Size Zheng
Jiangfei Duan
Xingcheng Zhang
Dahua Lin
MQ
MoE
218
0
0
09 May 2025
Diffusion Model Quantization: A Review
Diffusion Model Quantization: A Review
Qian Zeng
Chenggong Hu
Mingli Song
Jie Song
MQ
45
0
0
08 May 2025
Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
Sanjay Surendranath Girija
Shashank Kapoor
Lakshit Arora
Dipen Pradhan
Aman Raj
Ankit Shetgaonkar
57
0
0
05 May 2025
Radio: Rate-Distortion Optimization for Large Language Model Compression
Radio: Rate-Distortion Optimization for Large Language Model Compression
Sean I. Young
MQ
26
0
0
05 May 2025
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Enbo Zhao
Yi Shen
Shuming Shi
Jieyun Huang
Z. Chen
Rongjia Du
Siqi Xiao
Jingyang Zhang
Ning Wang
Shiguo Lian
MQ
49
0
0
05 May 2025
An Empirical Study of Qwen3 Quantization
An Empirical Study of Qwen3 Quantization
Xingyu Zheng
Yuye Li
Haoran Chu
Yue Feng
Xudong Ma
Jie Luo
Jinyang Guo
Haotong Qin
Michele Magno
Xianglong Liu
MQ
29
0
0
04 May 2025
RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
Chen Xu
Yuxuan Yue
Zukang Xu
Xing Hu
Jiangyong Yu
Zhixuan Chen
Sifan Zhou
Zhihang Yuan
Dawei Yang
MQ
32
0
0
02 May 2025
MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance
MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance
Xing Hu
Zhixuan Chen
Dawei Yang
Zukang Xu
Chen Xu
Zhihang Yuan
Sifan Zhou
Jiangyong Yu
MoE
MQ
44
0
0
02 May 2025
Position: Enough of Scaling LLMs! Lets Focus on Downscaling
Position: Enough of Scaling LLMs! Lets Focus on Downscaling
Ayan Sengupta
Yash Goel
Tanmoy Chakraborty
36
0
0
02 May 2025
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Haozheng Luo
Chenghao Qiu
Maojiang Su
Zhihan Zhou
Zoe Mehta
Guo Ye
Jerry Yao-Chieh Hu
Han Liu
AAML
55
1
0
01 May 2025
Scaling On-Device GPU Inference for Large Generative Models
Scaling On-Device GPU Inference for Large Generative Models
Jiuqiang Tang
Raman Sarokin
Ekaterina Ignasheva
Grant Jensen
Lin Chen
Juhyun Lee
Andrei Kulik
Matthias Grundmann
168
1
0
01 May 2025
Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models
Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models
Lucas Maisonnave
Cyril Moineau
Olivier Bichler
Fabrice Rastello
MQ
77
1
0
30 Apr 2025
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd Muhammad Kawakibi Zuhri
Erland Hilman Fuadi
Alham Fikri Aji
33
0
0
29 Apr 2025
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
A. Zandieh
Majid Daliri
Majid Hadian
Vahab Mirrokni
MQ
74
0
0
28 Apr 2025
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Zhenyu Zhang
Zechun Liu
Yuandong Tian
Harshit Khaitan
Zihan Wang
Steven Li
68
0
0
28 Apr 2025
Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving
Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving
J. Zhang
Flood Sung
Zhiyong Yang
Yang Gao
Chongjie Zhang
LLMAG
44
0
0
28 Apr 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Ziqiang Liu
Dong Li
E. Barsoum
61
0
0
23 Apr 2025
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
Yeona Hong
Hyewon Han
Woo-Jin Chung
Hong-Goo Kang
MQ
33
0
0
21 Apr 2025
Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs
Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs
Lucas Maisonnave
Cyril Moineau
Olivier Bichler
Fabrice Rastello
MQ
42
0
0
18 Apr 2025
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Tianyi Zhang
Yang Sui
Shaochen Zhong
V. Chaudhary
Xia Hu
Anshumali Shrivastava
MQ
32
0
0
15 Apr 2025
Understanding and Optimizing Multi-Stage AI Inference Pipelines
Understanding and Optimizing Multi-Stage AI Inference Pipelines
Abhimanyu Bambhaniya
Hanjiang Wu
Suvinay Subramanian
Sudarshan Srinivasan
Souvik Kundu
Amir Yazdanbakhsh
Suvinay Subramanian
Madhu Kumar
Tushar Krishna
180
0
0
14 Apr 2025
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
Yamato Arai
Yuma Ichikawa
MQ
34
0
0
13 Apr 2025
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Yanbiao Liang
Huihong Shi
Haikuo Shao
Zhongfeng Wang
33
0
0
07 Apr 2025
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Ruikang Liu
Yuxuan Sun
Manyi Zhang
Haoli Bai
Xianzhi Yu
Tiezheng Yu
C. Yuan
Lu Hou
MQ
LRM
39
6
0
07 Apr 2025
Saliency-driven Dynamic Token Pruning for Large Language Models
Saliency-driven Dynamic Token Pruning for Large Language Models
Yao Tao
Yehui Tang
Yun Wang
Mingjian Zhu
Hailin Hu
Yunhe Wang
36
0
0
06 Apr 2025
Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression
Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression
Ivan Ilin
Peter Richtárik
28
0
0
06 Apr 2025
HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse
HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse
Yuwei An
Yihua Cheng
Seo Jin Park
Junchen Jiang
44
1
0
03 Apr 2025
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
Beichen Huang
Yueming Yuan
Zelei Shao
Minjia Zhang
MQ
MoE
42
0
0
03 Apr 2025
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration
Yuhang Li
Ruokai Yin
Donghyun Lee
Shiting Xiao
Priyadarshini Panda
MQ
51
0
0
03 Apr 2025
LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi
LLMPi: Optimizing LLMs for High-Throughput on Raspberry Pi
Mahsa Ardakani
Jinendra Malekar
Ramtin Zand
MQ
42
0
0
02 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
51
0
0
31 Mar 2025
PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs
PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs
Jinendra Malekar
Peyton S. Chandarana
Md Hasibul Amin
Mohammed E. Elbtity
Ramtin Zand
26
1
0
31 Mar 2025
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Wei Tao
Bin Zhang
Xiaoyang Qu
Jiguang Wan
Jianzong Wang
47
1
0
30 Mar 2025
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Hung-Yueh Chiang
Chi-chih Chang
N. Frumkin
Kai-Chiang Wu
Mohamed S. Abdelfattah
Diana Marculescu
MQ
196
0
0
28 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
Fahad Shahbaz Khan
189
0
0
27 Mar 2025
1234...91011
Next