Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.03598
Cited By
Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
9 January 2023
Muhammad Osama
D. Merrill
C. Cecka
M. Garland
John Douglas Owens
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"
8 / 8 papers shown
Title
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Han Guo
William Brandon
Radostin Cholakov
Jonathan Ragan-Kelley
Eric P. Xing
Yoon Kim
MQ
131
16
0
20 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
116
33
0
02 Jan 2025
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
Chao Zeng
Songwei Liu
Shu Yang
Fangmin Chen
Xing Mei
Lean Fu
MQ
76
0
0
23 Dec 2024
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar
Srikant Bharadwaj
Renée St. Amant
Victor Rühle
Saravan Rajmohan
131
7
0
17 May 2024
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
Ruihang Lai
Junru Shao
Siyuan Feng
Steven Lyubomirsky
Bohan Hou
...
Sunghyun Park
Prakalp Srivastava
Jared Roesch
T. Mowry
Tianqi Chen
89
11
0
01 Nov 2023
MLPerf Training Benchmark
Arya D. McCarthy
Christine Cheng
Cody Coleman
Greg Diamos
Paulius Micikevicius
...
Carole-Jean Wu
Lingjie Xu
Masafumi Yamazaki
C. Young
Matei A. Zaharia
101
315
0
02 Oct 2019
Input-Aware Auto-Tuning of Compute-Bound HPC Kernels
Philippe Tillet
David D. Cox
38
36
0
15 Feb 2018
cuDNN: Efficient Primitives for Deep Learning
Sharan Chetlur
Cliff Woolley
Philippe Vandermersch
Jonathan M. Cohen
J. Tran
Bryan Catanzaro
Evan Shelhamer
133
1,848
0
03 Oct 2014
1