Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.04934
Cited By
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
7 November 2023
In Gim
Guojun Chen
Seung-seob Lee
Nikhil Sarda
Anurag Khandelwal
Lin Zhong
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Prompt Cache: Modular Attention Reuse for Low-Latency Inference"
5 / 55 papers shown
Title
CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
Yuhan Liu
Hanchen Li
Yihua Cheng
Siddhant Ray
Yuyang Huang
...
Ganesh Ananthanarayanan
Michael Maire
Henry Hoffmann
Ari Holtzman
Junchen Jiang
50
41
0
11 Oct 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
149
369
0
13 Mar 2023
Latency Adjustable Transformer Encoder for Language Understanding
Sajjad Kachuee
M. Sharifkhani
31
0
0
10 Jan 2022
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press
Noah A. Smith
M. Lewis
253
695
0
27 Aug 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,821
0
17 Sep 2019
Previous
1
2