Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.03079
Cited By
Clipper: A Low-Latency Online Prediction Serving System
9 December 2016
D. Crankshaw
Xin Wang
Giulio Zhou
Michael Franklin
Joseph E. Gonzalez
Ion Stoica
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Clipper: A Low-Latency Online Prediction Serving System"
50 / 71 papers shown
Title
ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
Seungbeom Choi
Jeonghoe Goo
Eunjoo Jeon
Mingyu Yang
Minsung Jang
21
0
0
14 May 2025
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Yong Li
Haojie Wang
Biao Hou
Jidong Zhai
40
0
0
12 May 2025
Patchwork: A Unified Framework for RAG Serving
Bodun Hu
Luis Pabon
Saurabh Agarwal
Aditya Akella
26
0
0
01 May 2025
LithOS: An Operating System for Efficient Machine Learning on GPUs
Patrick H. Coppock
Brian Zhang
Eliot H. Solomon
Vasilis Kypriotis
Leon Yang
Bikash Sharma
Dan Schatzberg
Todd C. Mowry
Dimitrios Skarlatos
32
0
0
21 Apr 2025
Cyber for AI at SemEval-2025 Task 4: Forgotten but Not Lost: The Balancing Act of Selective Unlearning in Large Language Models
Dinesh Srivasthav P
Bala Mallikarjunarao Garlapati
MU
47
0
0
02 Mar 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
205
0
0
08 Jan 2025
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
Ferdi Kossmann
Bruce Fontaine
Daya Khudia
Michael Cafarella
Samuel Madden
116
2
0
23 Oct 2024
Erasure Coded Neural Network Inference via Fisher Averaging
Divyansh Jhunjhunwala
Neharika Jali
Gauri Joshi
Shiqiang Wang
MoMe
FedML
31
1
0
02 Sep 2024
Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling
Sohaib Ahmad
Hui Guan
Ramesh K. Sitaraman
42
4
0
04 Jul 2024
Teola: Towards End-to-End Optimization of LLM-based Applications
Xin Tan
Yimin Jiang
Yitao Yang
Hong-Yu Xu
73
5
0
29 Jun 2024
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Victor C. M. Leung
Yanyi Guo
Xiping Hu
GNN
37
6
0
09 Apr 2024
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
Kamran Razavi
Saeid Ghafouri
Max Mühlhäuser
Pooyan Jamshidi
Lin Wang
29
3
0
31 Mar 2024
Genie: Smart ROS-based Caching for Connected Autonomous Robots
Zexin Li
Soroush Bateni
Cong Liu
37
1
0
29 Feb 2024
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Xupeng Miao
Gabriele Oliaro
Xinhao Cheng
Vineeth Kada
Ruohan Gao
...
April Yang
Yingcheng Wang
Mengdi Wu
Colin Unger
Zhihao Jia
MoE
94
9
0
29 Feb 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Yuting Yang
Andrea Merlina
Weijia Song
Tiancheng Yuan
Ken Birman
Roman Vitenberg
49
0
0
27 Feb 2024
Graft: Efficient Inference Serving for Hybrid Deep Learning with SLO Guarantees via DNN Re-alignment
Jing Wu
Lin Wang
Qirui Jin
Fangming Liu
33
11
0
17 Dec 2023
Synergy: Towards On-Body AI via Tiny AI Accelerator Collaboration on Wearables
Taesik Gong
S. Jang
Utku Günay Acer
F. Kawsar
Chulhong Min
41
2
0
11 Dec 2023
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
44
34
0
28 Oct 2023
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems
Debopam Sanyal
Jui-Tse Hung
Manavi Agrawal
Prahlad Jasti
Shahab Nikkhoo
S. Jha
Tianhao Wang
Sibin Mohan
Alexey Tumanov
45
0
0
03 Jul 2023
S
3
^{3}
3
: Increasing GPU Utilization during Generative Inference for Higher Throughput
Yunho Jin
Chun-Feng Wu
David Brooks
Gu-Yeon Wei
29
62
0
09 Jun 2023
FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
Minchen Yu
Ao Wang
Dong-dong Chen
Haoxuan Yu
Xiaonan Luo
Zhuohao Li
Wei Wang
Ruichuan Chen
Dapeng Nie
Haoran Yang
13
12
0
06 Jun 2023
Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service
Baolin Li
S. Samsi
V. Gadepally
Devesh Tiwari
28
27
0
19 Apr 2023
MadEye: Boosting Live Video Analytics Accuracy with Adaptive Camera Configurations
M. Wong
M. Ramanujam
Guha Balakrishnan
Ravi Netravali
34
4
0
04 Apr 2023
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
Yihao Zhao
Xin Liu
Shufan Liu
Xiang Li
Yibo Zhu
Gang Huang
Xuanzhe Liu
Xin Jin
32
11
0
24 Mar 2023
Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning
Gabriele Castellano
J. Nieto
Jordi Luque
Ferran Diego
Carlos Segura
Diego Perino
Flavio Esposito
Fulvio Risso
Aravindh Raman
11
0
0
31 Jan 2023
Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle
Alex Kogan
LRM
21
0
0
12 Jan 2023
Kernel-as-a-Service: A Serverless Interface to GPUs
Nathan Pemberton
Anton Zabreyko
Zhoujie Ding
R. Katz
Joseph E. Gonzalez
29
8
0
15 Dec 2022
iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud
Fei Xu
Jianian Xu
Jiabin Chen
Li Chen
Ruitao Shang
Zhi Zhou
Fengyuan Liu
GNN
32
35
0
03 Nov 2022
Management of Machine Learning Lifecycle Artifacts: A Survey
Marius Schlegel
K. Sattler
25
35
0
21 Oct 2022
Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference
Zehuan Wang
Yingcan Wei
Minseok Lee
Matthias Langer
F. Yu
...
Daniel G. Abel
Xu Guo
Jianbing Dong
Ji Shi
Kunlun Li
GNN
LRM
25
32
0
17 Oct 2022
KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
Baolin Li
S. Samsi
V. Gadepally
Devesh Tiwari
24
11
0
12 Oct 2022
Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs
Alexandros Kouris
Stylianos I. Venieris
Stefanos Laskaridis
Nicholas D. Lane
42
8
0
27 Sep 2022
Improving the Performance of DNN-based Software Services using Automated Layer Caching
M. Abedi
Yanni Iouannou
Pooyan Jamshidi
Hadi Hemmati
25
0
0
18 Sep 2022
Operationalizing Machine Learning: An Interview Study
Shreya Shankar
Rolando Garcia
J. M. Hellerstein
Aditya G. Parameswaran
71
51
0
16 Sep 2022
An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks
Pierrick Pochelu
S. Petiton
B. Conche
14
2
0
30 Aug 2022
RIBBON: Cost-Effective and QoS-Aware Deep Learning Model Inference using a Diverse Pool of Cloud Computing Instances
Baolin Li
Rohan Basu Roy
Tirthak Patel
V. Gadepally
K. Gettings
Devesh Tiwari
32
25
0
23 Jul 2022
On Efficient Approximate Queries over Machine Learning Models
Dujian Ding
S. Amer-Yahia
L. Lakshmanan
19
5
0
06 Jun 2022
Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
Yongji Wu
Matthew Lentz
Danyang Zhuo
Yao Lu
29
22
0
10 May 2022
Pathways: Asynchronous Distributed Dataflow for ML
P. Barham
Aakanksha Chowdhery
J. Dean
Sanjay Ghemawat
Steven Hand
...
Parker Schuh
Ryan Sepassi
Laurent El Shafey
C. A. Thekkath
Yonghui Wu
GNN
MoE
45
126
0
23 Mar 2022
GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge
Arthi Padmanabhan
Neil Agarwal
Anand Iyer
Ganesh Ananthanarayanan
Yuanchao Shu
Nikolaos Karianakis
G. Xu
Ravi Netravali
43
59
0
19 Jan 2022
Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem
Cheng Tan
Zhichao Li
Jian Zhang
Yunyin Cao
Sikai Qi
Zherui Liu
Yibo Zhu
Chuanxiong Guo
21
34
0
18 Sep 2021
SensiX++: Bringing MLOPs and Multi-tenant Model Serving to Sensory Edge Devices
Chulhong Min
Akhil Mathur
Utku Günay Acer
A. Montanari
F. Kawsar
25
11
0
08 Sep 2021
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning
S. Choi
Sunho Lee
Yeonjae Kim
Jongse Park
Youngjin Kwon
Jaehyuk Huh
30
21
0
01 Sep 2021
Computation and Communication Co-Design for Real-Time Monitoring and Control in Multi-Agent Systems
Vishrant Tripathi
Luca Ballotta
Luca Carlone
E. Modiano
16
10
0
06 Aug 2021
Concept for a Technical Infrastructure for Management of Predictive Models in Industrial Applications
F. Bachinger
G. Kronberger
14
5
0
29 Jul 2021
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
54
94
0
01 Jul 2021
ModelPS: An Interactive and Collaborative Platform for Editing Pre-trained Models at Scale
Yuanming Li
Huaizheng Zhang
Shanshan Jiang
Fan Yang
Yonggang Wen
Yong Luo
21
2
0
18 May 2021
DeepRT: A Soft Real Time Scheduler for Computer Vision Applications on the Edge
Zhe Yang
K. Nahrstedt
Hongpeng Guo
Qian Zhou
11
21
0
05 May 2021
Accelerating Deep Learning Inference via Learned Caches
Arjun Balasubramanian
Adarsh Kumar
Yuhan Liu
Han Cao
Shivaram Venkataraman
Aditya Akella
28
18
0
18 Jan 2021
PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment
Meghana Madhyastha
Kunal Lillaney
J. Browne
Joshua T. Vogelstein
Randal C. Burns
25
1
0
10 Nov 2020
1
2
Next