Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.12984
Cited By
v1
v2 (latest)
Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving
17 April 2025
Yaoyao Ding
Bohan Hou
Xinyu Zhang
Allan Lin
Tianqi Chen
Cody Yu Hao
Yida Wang
Gennady Pekhimenko
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving"
29 / 29 papers shown
Title
SpinQuant: LLM quantization with learned rotations
Zechun Liu
Changsheng Zhao
Igor Fedorov
Bilge Soran
Dhruv Choudhary
Raghuraman Krishnamoorthi
Vikas Chandra
Yuandong Tian
Tijmen Blankevoort
MQ
234
124
0
21 Feb 2025
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
135
908
0
31 Jul 2024
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLM
VLM
MU
167
973
0
15 Jul 2024
eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization
Aditya Agrawal
Matthew Hedlund
Blake A. Hechtman
MQ
77
4
0
22 May 2024
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Saleh Ashkboos
Amirkeivan Mohtashami
Maximilian L. Croci
Bo Li
Martin Jaggi
Dan Alistarh
Torsten Hoefler
James Hensman
MQ
110
181
0
30 Mar 2024
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Albert Tseng
Jerry Chee
Qingyao Sun
Volodymyr Kuleshov
Christopher De Sa
MQ
194
128
0
06 Feb 2024
Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition
Adnan Hoque
Less Wright
Chih-Chieh Yang
Mudhakar Srivatsa
R. Ganti
18
1
0
05 Jan 2024
Microscaling Data Formats for Deep Learning
B. Rouhani
Ritchie Zhao
Ankit More
Mathew Hall
Alireza Khodamoradi
...
Maxim Naumov
Colin Verilli
Ralph Wittig
Doug Burger
Eric S. Chung
MQ
96
63
0
16 Oct 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
192
2,311
0
12 Sep 2023
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Jerry Chee
Yaohui Cai
Volodymyr Kuleshov
Chris De Sa
MQ
91
209
0
25 Jul 2023
Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU
Muhammad Osama
D. Merrill
C. Cecka
M. Garland
John Douglas Owens
57
28
0
09 Jan 2023
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
401
2,394
0
09 Nov 2022
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar
Saleh Ashkboos
Torsten Hoefler
Dan Alistarh
MQ
140
1,005
0
31 Oct 2022
ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs
Guyue Huang
Yang Bai
Liu Liu
Yuke Wang
Bei Yu
Yufei Ding
Yuan Xie
76
17
0
29 Oct 2022
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
Yaoyao Ding
Cody Hao Yu
Bojian Zheng
Yizhi Liu
Yida Wang
Gennady Pekhimenko
56
32
0
18 Oct 2022
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers
M. Lewis
Younes Belkada
Luke Zettlemoyer
MQ
103
662
0
15 Aug 2022
TensorIR: An Abstraction for Automatic Tensorized Program Optimization
Siyuan Feng
Bohan Hou
Hongyi Jin
Wuwei Lin
Junru Shao
...
Zihao Ye
Lianmin Zheng
Cody Hao Yu
Yong Yu
Tianqi Chen
51
68
0
09 Jul 2022
Tensor Program Optimization with Probabilistic Programs
Junru Shao
Xiyou Zhou
Siyuan Feng
Bohan Hou
Ruihang Lai
Hongyi Jin
Wuwei Lin
Masahiro Masuda
Cody Hao Yu
Tianqi Chen
76
31
0
26 May 2022
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance
Jiarong Xing
Leyuan Wang
Shang Zhang
Jack H Chen
Ang Chen
Yibo Zhu
58
44
0
25 Oct 2021
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding
Pratik Fegade
Tianqi Chen
Phillip B. Gibbons
T. Mowry
49
29
0
19 Oct 2021
DISC: A Dynamic Shape Compiler for Machine Learning Workloads
Kai Zhu
Wenyi Zhao
Zhen Zheng
Tianyou Guo
Pengzhan Zhao
...
Junjie Bai
Jun Yang
Xiaoyong Liu
Lansong Diao
Wei Lin
55
27
0
09 Mar 2021
Cortex: A Compiler for Recursive Deep Learning Models
Pratik Fegade
Tianqi Chen
Phillip B. Gibbons
T. Mowry
VLM
39
28
0
02 Nov 2020
IOS: Inter-Operator Scheduler for CNN Acceleration
Yaoyao Ding
Ligeng Zhu
Zhihao Jia
Gennady Pekhimenko
Song Han
61
75
0
02 Nov 2020
Ansor: Generating High-Performance Tensor Programs for Deep Learning
Lianmin Zheng
Chengfan Jia
Minmin Sun
Zhao Wu
Cody Hao Yu
...
Jun Yang
Danyang Zhuo
Koushik Sen
Joseph E. Gonzalez
Ion Stoica
142
402
0
11 Jun 2020
Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference
Haichen Shen
Jared Roesch
Zhi Chen
Wei Chen
Yong Wu
Mu Li
Vin Sharma
Zachary Tatlock
Yida Wang
46
57
0
04 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
874
42,379
0
28 May 2020
Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation
Byung Hoon Ahn
Prannoy Pilligundla
Amir Yazdanbakhsh
H. Esmaeilzadeh
ODL
98
82
0
23 Jan 2020
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
Nicolas Vasilache
O. Zinenko
Theodoros Theodoridis
Priya Goyal
Zach DeVito
William S. Moses
Sven Verdoolaege
Andrew Adams
Albert Cohen
76
436
0
13 Feb 2018
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
783
132,363
0
12 Jun 2017
1