ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.17312
  4. Cited By
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV
  Caching

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

26 March 2024
Youpeng Zhao
Di Wu
Jun Wang
ArXivPDFHTML

Papers citing "ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching"

9 / 9 papers shown
Title
Cognitive Memory in Large Language Models
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAG
KELM
368
2
0
03 Apr 2025
Mitigating KV Cache Competition to Enhance User Experience in LLM Inference
Mitigating KV Cache Competition to Enhance User Experience in LLM Inference
Haiying Shen
Tanmoy Sen
Masahiro Tanaka
310
0
0
17 Mar 2025
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
Linke Song
Zixuan Pang
Wenhao Wang
Zihao Wang
XiaoFeng Wang
Hongbo Chen
Wei Song
Yier Jin
Dan Meng
Rui Hou
75
8
0
30 Sep 2024
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
  Language Models
H2_22​O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang
Ying Sheng
Dinesh Manocha
Tianlong Chen
Lianmin Zheng
...
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
VLM
103
275
0
24 Jun 2023
ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
  Transformer Acceleration with a Linear Taylor Attention
ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention
Jyotikrishna Dass
Shang Wu
Huihong Shi
Chaojian Li
Zhifan Ye
Zhongfeng Wang
Yingyan Lin
30
54
0
09 Nov 2022
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and
  Head Pruning
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Hanrui Wang
Zhekai Zhang
Song Han
87
384
0
17 Dec 2020
Linformer: Self-Attention with Linear Complexity
Linformer: Self-Attention with Linear Complexity
Sinong Wang
Belinda Z. Li
Madian Khabsa
Han Fang
Hao Ma
152
1,678
0
08 Jun 2020
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
77
3,996
0
10 Apr 2020
Robust Quantization: One Model to Rule Them All
Robust Quantization: One Model to Rule Them All
Moran Shkolnik
Brian Chmiel
Ron Banner
Gil Shomron
Yury Nahshan
A. Bronstein
U. Weiser
OOD
MQ
37
75
0
18 Feb 2020
1