ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.02069
  4. Cited By
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

4 June 2024
Zefan Cai
Yichi Zhang
Bofei Gao
Yuliang Liu
Y. Li
K. Lu
Wayne Xiong
Yue Dong
Baobao Chang
Junjie Hu
Wen Xiao
ArXivPDFHTML

Papers citing "PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling"

50 / 66 papers shown
Title
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Y. Chen
J. Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
38
0
0
05 May 2025
FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension
FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension
Jushi Kai
Boyi Zeng
Y. Wang
Haoli Bai
Bo Jiang
Zhouhan Lin
37
0
0
01 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoE
VLM
96
1
0
01 May 2025
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
A. Zandieh
Majid Daliri
Majid Hadian
Vahab Mirrokni
MQ
74
0
0
28 Apr 2025
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Piotr Nawrot
Robert Li
Renjie Huang
Sebastian Ruder
Kelly Marchisio
E. Ponti
34
0
0
24 Apr 2025
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Yangshen Deng
Zhengxin You
Long Xiang
Qilong Li
Peiqi Yuan
...
Man Lung Yiu
Huan Li
Qiaomu Shen
Rui Mao
Bo Tang
34
0
0
14 Apr 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
62
0
0
12 Apr 2025
Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models
Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models
Yu Fu
Haz Sameen Shahgir
Hui Liu
Xianfeng Tang
Qi He
Yue Dong
KELM
47
0
0
11 Apr 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Yiying Yang
Wei Cheng
Sijin Chen
Xianfang Zeng
Jiaxu Zhang
Liao Wang
Gang Yu
Xingjun Ma
Yu Jiang
VLM
42
0
0
08 Apr 2025
Cognitive Memory in Large Language Models
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAG
KELM
154
1
0
03 Apr 2025
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Jewon Lee
Ki-Ung Song
Seungmin Yang
Donguk Lim
Jaeyeon Kim
Wooksu Shin
Bo-Kyeong Kim
Yong Jae Lee
Tae-Ho Kim
VLM
55
0
0
01 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
63
0
0
01 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
46
0
0
31 Mar 2025
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
Kai Huang
Hao Zou
Bochen Wang
Ye Xi
Zhen Xie
Hao Wang
VLM
42
0
0
31 Mar 2025
EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
Jiyu Chen
Shuang Peng
Daxiong Luo
Fan Yang
Renshou Wu
Fangyuan Li
Xiaoxin Chen
46
0
0
28 Mar 2025
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
Youhui Zuo
Sibo Wei
C. Zhang
Zhuorui Liu
Wenpeng Lu
Dawei Song
VLM
56
0
0
23 Mar 2025
GPU-Accelerated Motion Planning of an Underactuated Forestry Crane in Cluttered Environments
GPU-Accelerated Motion Planning of an Underactuated Forestry Crane in Cluttered Environments
M. Vu
Gerald Ebmer
Alexander Watcher
Marc-Philip Ecker
Giang Nguyen
Tobias Glueck
69
2
0
18 Mar 2025
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
Ziran Qin
Yuchen Cao
Mingbao Lin
Wen Hu
Shixuan Fan
Ke Cheng
Weiyao Lin
Jianguo Li
69
3
0
16 Mar 2025
ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
Xin Liu
Pei Liu
Guoming Tang
MoMe
52
0
0
13 Mar 2025
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Bozhi Luan
Wengang Zhou
Hao Feng
Zhe Wang
Xiaosong Li
H. Li
VLM
63
0
0
11 Mar 2025
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
Giulio Corallo
Orion Weller
Fabio Petroni
Paolo Papotti
MQ
VLM
52
0
0
06 Mar 2025
Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs
Ravi Ghadia
Avinash Kumar
Gaurav Jain
Prashant J. Nair
Poulami Das
38
1
0
02 Mar 2025
LightThinker: Thinking Step-by-Step Compression
LightThinker: Thinking Step-by-Step Compression
Jintian Zhang
Yuqi Zhu
Mengshu Sun
Yujie Luo
Shuofei Qiao
Lun Du
Da Zheng
H. Chen
N. Zhang
LRM
LLMAG
49
10
0
24 Feb 2025
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance
Xuanfan Ni
Liyan Xu
Chenyang Lyu
Longyue Wang
Mo Yu
Lemao Liu
Fandong Meng
Jie Zhou
Piji Li
50
0
0
24 Feb 2025
KVCrush: Key value cache size-reduction using similarity in head-behaviour
Gopi Krishna Jha
Sameh Gobriel
Liubov Talamanova
Alexander Kozlov
Nilesh Jain
MQ
34
0
0
24 Feb 2025
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan
H. Shen
Xin Wang
C. Liu
Zheda Mai
M. Zhang
VLM
65
3
0
24 Feb 2025
CoKV: Optimizing KV Cache Allocation via Cooperative Game
CoKV: Optimizing KV Cache Allocation via Cooperative Game
Qiheng Sun
Hongwei Zhang
Haocheng Xia
Jiayao Zhang
Jinfei Liu
Kui Ren
VLM
37
0
0
21 Feb 2025
Compression Barriers for Autoregressive Transformers
Compression Barriers for Autoregressive Transformers
Themistoklis Haris
Krzysztof Onak
37
1
0
21 Feb 2025
Neural Attention Search
Neural Attention Search
Difan Deng
Marius Lindauer
88
0
0
21 Feb 2025
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
Bingzhe Zhao
Ke Cheng
Aomufei Yuan
Yuxuan Tian
Ruiguang Zhong
Chengchen Hu
Tong Yang
Lian Yu
44
0
0
19 Feb 2025
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Kan Zhu
Tian Tang
Qinyu Xu
Yile Gu
Zhichen Zeng
Rohan Kadekodi
Liangyu Zhao
Ang Li
Arvind Krishnamurthy
Baris Kasikci
59
2
0
17 Feb 2025
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity
Junhao Hu
Wenrui Huang
Weidong Wang
Zhenwen Li
Tiancheng Hu
Zhixia Liu
Xusheng Chen
Tao Xie
Yizhou Shan
LRM
48
0
0
16 Feb 2025
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
Zeliang Zhang
Yifan Zhu
Susan Liang
Zhiyuan Wang
Jiani Liu
...
Mingjie Zhao
Chenliang Xu
Kun Wan
Wentian Zhao
Wentian Zhao
VLM
MQ
43
0
0
15 Feb 2025
BalanceKV: KV Cache Compression through Discrepancy Theory
BalanceKV: KV Cache Compression through Discrepancy Theory
Insu Han
Michael Kapralov
Ekaterina Kochetkova
Kshiteej Sheth
A. Zandieh
84
2
0
11 Feb 2025
LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs
LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs
Sumin An
Junyoung Sung
Wonpyo Park
Chanjun Park
Paul Hongsuck Seo
97
0
0
10 Feb 2025
Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Yuan Feng
Junlin Lv
Y. Cao
Xike Xie
S.Kevin Zhou
71
2
0
06 Feb 2025
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Twilight: Adaptive Attention Sparsity with Hierarchical Top-ppp Pruning
C. Lin
Jiaming Tang
Shuo Yang
Hanshuo Wang
Tian Tang
Boyu Tian
Ion Stoica
Song Han
Mingyu Gao
92
2
0
04 Feb 2025
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu
Zhenheng Tang
Hong Chen
Peijie Dong
Zeyu Li
Xiuze Zhou
Bo Li
Xuming Hu
Xiaowen Chu
168
3
0
04 Feb 2025
PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
Songhao Wu
Ang Lv
Xiao Feng
Y. Zhang
Xun Zhang
Guojun Yin
Wei Lin
Rui Yan
MQ
52
0
0
01 Feb 2025
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng
Junlin Lv
Yukun Cao
Xike Xie
S. K. Zhou
VLM
53
27
0
28 Jan 2025
Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
WeiZhi Fei
Xueyan Niu
Guoqing Xie
Yingqing Liu
Bo Bai
Wei Han
33
1
0
22 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Junming Ma
Jin Tan
Runsheng Wang
Meng Li
49
0
0
12 Jan 2025
Tensor Product Attention Is All You Need
Tensor Product Attention Is All You Need
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Q. Gu
Andrew Chi-Chih Yao
77
9
0
11 Jan 2025
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
Zhuomin He
Yizhen Yao
Pengfei Zuo
Bin Gao
Qinya Li
Zhenzhe Zheng
Fan Wu
52
0
0
04 Jan 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
...
K. Zhang
C. L. P. Chen
Fan Yang
Y. Yang
Lili Qiu
49
29
0
03 Jan 2025
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based
  on Layer Uncertainty
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
M. Zhong
Xikai Liu
C. Zhang
Yikun Lei
Yan Gao
Yao Hu
Kehai Chen
Min Zhang
78
0
0
12 Dec 2024
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM
  Inference
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
Weizhuo Li
Zhigang Wang
Yu Gu
Ge Yu
MQ
70
0
0
08 Dec 2024
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following
  Models Need for Efficient Generation
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang
Hui Chen
Jianchao Tan
K. Zhang
Xunliang Cai
Zijia Lin
J. Han
Guiguang Ding
VLM
77
3
0
04 Dec 2024
Unifying KV Cache Compression for Large Language Models with LeanKV
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
130
5
0
04 Dec 2024
MiniKV: Pushing the Limits of LLM Inference via 2-Bit
  Layer-Discriminative KV Cache
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Akshat Sharma
Hangliang Ding
Jianping Li
Neel Dani
Minjia Zhang
77
1
0
27 Nov 2024
12
Next