Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2010.05680
Cited By
TurboTransformers: An Efficient GPU Serving System For Transformer Models
9 October 2020
Jiarui Fang
Yang Yu
Chen-liang Zhao
Jie Zhou
Re-assign community
ArXiv
PDF
HTML
Papers citing
"TurboTransformers: An Efficient GPU Serving System For Transformer Models"
21 / 21 papers shown
Title
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Wei Zhang
Zhiyu Wu
Yi Mu
Banruo Liu
Myungjin Lee
Fan Lai
60
0
0
24 Apr 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
77
0
0
09 Mar 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
277
0
0
08 Jan 2025
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
156
5
0
04 Dec 2024
HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
Xuanlei Zhao
Bin Jia
Hao Zhou
Ziming Liu
Shenggan Cheng
Yang You
34
4
0
02 Mar 2024
ReLU
2
^2
2
Wins: Discovering Efficient Activation Functions for Sparse LLMs
Zhengyan Zhang
Yixin Song
Guanghui Yu
Xu Han
Yankai Lin
Chaojun Xiao
Chenyang Song
Zhiyuan Liu
Zeyu Mi
Maosong Sun
27
31
0
06 Feb 2024
A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
Harsh Sharma
Pratyush Dhingra
J. Doppa
Ümit Y. Ogras
P. Pande
36
7
0
18 Dec 2023
NNQS-Transformer: an Efficient and Scalable Neural Network Quantum States Approach for Ab initio Quantum Chemistry
Yangjun Wu
Chu Guo
Yi Fan
P. Zhou
Honghui Shang
GNN
36
15
0
29 Jun 2023
S
3
^{3}
3
: Increasing GPU Utilization during Generative Inference for Higher Throughput
Yunho Jin
Chun-Feng Wu
David Brooks
Gu-Yeon Wei
39
62
0
09 Jun 2023
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Zangwei Zheng
Xiaozhe Ren
Fuzhao Xue
Yang Luo
Xin Jiang
Yang You
46
55
0
22 May 2023
Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle
Alex Kogan
LRM
24
0
0
12 Jan 2023
SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision
Rong Tian
Zijing Zhao
Weijie Liu
Haoyan Liu
Weiquan Mao
Zhe Zhao
Kimmo Yan
MQ
27
5
0
19 Sep 2022
Boosting Distributed Training Performance of the Unpadded BERT Model
Jinle Zeng
Min Li
Zhihua Wu
Jiaqi Liu
Yuang Liu
Dianhai Yu
Yanjun Ma
25
10
0
17 Aug 2022
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining
Hongwu Peng
Shaoyi Huang
Shiyang Chen
Bingbing Li
Tong Geng
...
Weiwen Jiang
Wujie Wen
J. Bi
Hang Liu
Caiwen Ding
47
54
0
07 Aug 2022
HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle
Guoxia Wang
Xiaomin Fang
Zhihua Wu
Yiqun Liu
Yang Xue
Yingfei Xiang
Dianhai Yu
Fan Wang
Yanjun Ma
36
31
0
12 Jul 2022
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Reza Yazdani Aminabadi
Samyam Rajbhandari
Minjia Zhang
A. A. Awan
Cheng-rong Li
...
Elton Zheng
Jeff Rasley
Shaden Smith
Olatunji Ruwase
Yuxiong He
33
342
0
30 Jun 2022
Answer Fast: Accelerating BERT on the Tensor Streaming Processor
I. Ahmed
Sahil Parmar
Matthew Boyd
Michael Beidler
Kris Kang
Bill Liu
Kyle Roach
John Kim
D. Abts
LLMAG
20
6
0
22 Jun 2022
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Shenggan Cheng
Xuanlei Zhao
Guangyang Lu
Bin-Rui Li
Zhongming Yu
Tian Zheng
R. Wu
Xiwen Zhang
Jian Peng
Yang You
AI4CE
27
30
0
02 Mar 2022
Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Mark Zhao
Niket Agarwal
Aarti Basant
B. Gedik
Satadru Pan
...
Kevin Wilfong
Harsha Rastogi
Carole-Jean Wu
Christos Kozyrakis
Parikshit Pol
GNN
34
70
0
20 Aug 2021
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
Jiarui Fang
Zilin Zhu
Shenggui Li
Hui Su
Yang Yu
Jie Zhou
Yang You
VLM
37
24
0
12 Aug 2021
Optimizing Inference Performance of Transformers on CPUs
D. Dice
Alex Kogan
19
15
0
12 Feb 2021
1