Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

19 September 2023

Zhen Zheng

Yong Li

Shuaiwen Leon Song

Papers citing "Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity"

7 / 7 papers shown

Title
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System Yintao He Haiyu Mao Christina Giannoula Mohammad Sadrosadati Juan Gómez Luna Huawei Li Xiaowei Li Ying Wang O. Mutlu 43 5 0 21 Feb 2025
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin ... Joseph E. Gonzalez Percy Liang Christopher Ré Ion Stoica Ce Zhang 149 369 0 13 Mar 2023
Efficient Quantized Sparse Matrix Operations on Tensor Cores Shigang Li Kazuki Osawa Torsten Hoefler 79 31 0 14 Sep 2022
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning Zihao Ye Ruihang Lai Junru Shao Tianqi Chen Luis Ceze 76 91 0 11 Jul 2022
Can Foundation Models Wrangle Your Data? A. Narayan Ines Chami Laurel J. Orr Simran Arora Christopher Ré LMTD AI4CE 181 214 0 20 May 2022
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Torsten Hoefler Dan Alistarh Tal Ben-Nun Nikoli Dryden Alexandra Peste MQ 141 684 0 31 Jan 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,821 0 17 Sep 2019