CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and
Selective Sparsification

CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification

2 September 2024

Chun Jason Xue

ArXiv (abs)PDF HTML

Papers citing "CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification"

7 / 7 papers shown

Title
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Tianle Cai Yuhong Li Zhengyang Geng Hongwu Peng Jason D. Lee De-huai Chen Tri Dao 172 314 0 19 Jan 2024
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Ji Lin Jiaming Tang Haotian Tang Shang Yang Wei-Ming Chen Wei-Chen Wang Guangxuan Xiao Xingyu Dang Chuang Gan Song Han EDL MQ 106 578 0 01 Jun 2023
Fast Inference from Transformers via Speculative Decoding Yaniv Leviathan Matan Kalman Yossi Matias LRM 151 736 0 30 Nov 2022
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Elias Frantar Saleh Ashkboos Torsten Hoefler Dan Alistarh MQ 152 1,008 0 31 Oct 2022
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning Elias Frantar Sidak Pal Singh Dan Alistarh MQ 109 243 0 24 Aug 2022
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury ... Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai Soumith Chintala ODL 565 42,639 0 03 Dec 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Colin Raffel Noam M. Shazeer Adam Roberts Katherine Lee Sharan Narang Michael Matena Yanqi Zhou Wei Li Peter J. Liu AIMat 503 20,342 0 23 Oct 2019

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.