ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.14048
  4. Cited By
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
  Language Models
v1v2v3 (latest)

H2_22​O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

24 June 2023
Zhenyu Zhang
Ying Sheng
Dinesh Manocha
Tianlong Chen
Lianmin Zheng
Ruisi Cai
Zhao Song
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
    VLM
ArXiv (abs)PDFHTMLGithub (447★)

Papers citing "H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models"

50 / 215 papers shown
Title
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Adithya Bhaskar
Alexander Wettig
Tianyu Gao
Yihe Dong
Danqi Chen
22
0
0
20 Jun 2025
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Haoyue Zhang
Hualei Zhang
Xiaosong Ma
Jie Zhang
Song Guo
LRM
22
0
0
19 Jun 2025
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Minsoo Kim
Kyuhong Shim
Jungwook Choi
Simyung Chang
VLM
20
0
0
18 Jun 2025
Early Attentive Sparsification Accelerates Neural Speech Transcription
Early Attentive Sparsification Accelerates Neural Speech Transcription
Zifei Xu
Sayeh Sharify
Hesham Mostafa
T. Webb
W. Yazar
Xin Wang
19
0
0
18 Jun 2025
Multipole Attention for Efficient Long Context Reasoning
Multipole Attention for Efficient Long Context Reasoning
Coleman Hooper
Sebastian Zhao
Luca Manolache
Sehoon Kim
Michael W. Mahoney
Y. Shao
Kurt Keutzer
Amir Gholami
OffRLLRM
33
0
0
16 Jun 2025
Lag-Relative Sparse Attention In Long Context Training
Lag-Relative Sparse Attention In Long Context Training
Manlai Liang
Wanyi Huang
Mandi Liu
Huaijun Li
Jinlong Li
RALM
19
0
0
13 Jun 2025
Efficient Long-Context LLM Inference via KV Cache Clustering
Efficient Long-Context LLM Inference via KV Cache Clustering
Jie Hu
Shengnan Wang
Yutong He
Ping Gong
Jiawei Yi
...
Youhui Bai
Renhai Chen
Gong Zhang
Cheng-rong Li
Kun Yuan
29
0
0
13 Jun 2025
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
Yeonju Ro
Zhenyu Zhang
Souvik Kundu
Zhangyang Wang
Aditya Akella
112
0
0
11 Jun 2025
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
Yizhao Gao
Shuming Guo
Shijie Cao
Yuqing Xia
Yu Cheng
...
Hayden Kwok-Hay So
Yu Hua
Ting Cao
Fan Yang
Mao Yang
VLMLRM
41
0
0
10 Jun 2025
Spark Transformer: Reactivating Sparsity in FFN and Attention
Spark Transformer: Reactivating Sparsity in FFN and Attention
Chong You
Kan Wu
Zhipeng Jia
Lin Chen
Srinadh Bhojanapalli
...
Felix X. Yu
Prateek Jain
David Culler
Henry M. Levy
Sanjiv Kumar
26
0
0
07 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
116
0
0
05 Jun 2025
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang
Z. Liu
Yongming Rao
Jiwen Lu
VLMLRM
187
0
0
05 Jun 2025
Through the Stealth Lens: Rethinking Attacks and Defenses in RAG
Sarthak Choudhary
Nils Palumbo
Ashish Hooda
Krishnamurthy Dvijotham
Somesh Jha
47
0
0
04 Jun 2025
KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
Jiahao Wang
Jinbo Han
Xingda Wei
Sijie Shen
Dingyan Zhang
Chenguang Fang
Rong Chen
Wenyuan Yu
Haibo Chen
93
1
0
03 Jun 2025
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
Woomin Song
Sai Muralidhar Jayanthi
S. Ronanki
Kanthashree Mysore Sathyendra
Jinwoo Shin
Aram Galstyan
Shubham Katiyar
S. Bodapati
VLM
56
0
0
01 Jun 2025
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
Xiaodong Ji
Hailin Zhang
Fangcheng Fu
Bin Cui
33
0
0
30 May 2025
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
Xianglong Yan
Zhiteng Li
Tianao Zhang
Linghe Kong
Yulun Zhang
Xiaokang Yang
73
0
0
30 May 2025
Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
Feiyu Yao
Qian Wang
34
0
0
30 May 2025
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim
Jinuk Kim
S. Kwon
Jae W. Lee
Sangdoo Yun
Hyun Oh Song
MQVLM
65
0
0
29 May 2025
Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Donghyeon Joo
Helya Hosseini
Ramyad Hadidi
Bahar Asgari
74
0
0
28 May 2025
Curse of High Dimensionality Issue in Transformer for Long-context Modeling
Curse of High Dimensionality Issue in Transformer for Long-context Modeling
Shuhai Zhang
Zeng You
Yaofo Chen
Z. Wen
Qianyue Wang
Zhijie Qiu
Yuanqing Li
Mingkui Tan
50
0
0
28 May 2025
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
Tianyu Guo
Hande Dong
Yichong Leng
Feng Liu
Cheater Lin
Nong Xiao
X. Zhang
RALM
31
0
0
28 May 2025
Efficient Multi-modal Long Context Learning for Training-free Adaptation
Efficient Multi-modal Long Context Learning for Training-free Adaptation
Zehong Ma
Shiliang Zhang
Longhui Wei
Qi Tian
VLM
53
0
0
26 May 2025
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
Dong Liu
Jiayi Zhang
Yifan Li
Yanxuan Yu
Ben Lengerich
Ying Nian Wu
73
1
0
26 May 2025
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li
Zigeng Chen
Cheng-Yen Yang
Jenq-Neng Hwang
95
0
0
26 May 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Benjamin Schneider
Dongfu Jiang
Chao Du
Tianyu Pang
Wenhu Chen
VLM
79
0
0
22 May 2025
MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention
MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention
Chaoyi Jiang
Sungwoo Kim
Lei Gao
Hossein Entezari Zarch
Won Woo Ro
Murali Annavaram
29
0
0
22 May 2025
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Jinghui Lu
Haiyang Yu
Siliang Xu
Shiwei Ran
Guozhi Tang
...
Teng Fu
Hao Feng
Jingqun Tang
Hongru Wang
Can Huang
LRM
116
3
0
21 May 2025
Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Penghao Wu
Lewei Lu
Ziwei Liu
131
0
0
21 May 2025
UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache
UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache
Pu Wang
Pengwen Dai
Chen Wu
Yeying Jin
Dianjie Lu
Guijuan Zhang
Youshan Zhang
Zhuoran Zheng
66
1
0
20 May 2025
GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment
GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment
Jiwei Tang
Zhicheng Zhang
Shunlong Wu
Jingheng Ye
Lichen Bai
...
Tingwei Lu
Jiaqi Chen
Lin Hai
Hai-Tao Zheng
Hong-Gee Kim
61
0
0
18 May 2025
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
Jeffrey Willette
Heejun Lee
Sung Ju Hwang
77
0
0
16 May 2025
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training
Feijiang Han
Xiaodong Yu
Jianheng Tang
Lyle Ungar
104
0
0
16 May 2025
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
Zehao Fan
Garrett Gagnon
Zhenyu Liu
Liu Liu
67
0
0
09 May 2025
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
A. Zandieh
Majid Daliri
Majid Hadian
Vahab Mirrokni
MQ
132
0
0
28 Apr 2025
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Zhenyu Zhang
Zechun Liu
Yuandong Tian
Harshit Khaitan
Ziyi Wang
Steven Li
106
3
0
28 Apr 2025
An Empirical Study on Prompt Compression for Large Language Models
An Empirical Study on Prompt Compression for Large Language Models
Zhenru Zhang
Jinyi Li
Yihuai Lan
Xinze Wang
Hao Wang
MQ
84
0
0
24 Apr 2025
Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention
Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention
Xiang Hu
Jiaqi Leng
Jun Zhao
Kewei Tu
Wei Wu
Mamba
110
0
0
23 Apr 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Ziqiang Liu
Dong Li
E. Barsoum
184
0
0
23 Apr 2025
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
Yaxiong Wu
Sheng Liang
Chen Zhang
Yucheng Wang
Yanzhe Zhang
Huifeng Guo
Ruiming Tang
Yong Liu
KELM
143
7
0
22 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matthew J Morse
Raghavv Goel
Mingu Lee
Chris Lott
100
1
0
21 Apr 2025
Efficient Pretraining Length Scaling
Efficient Pretraining Length Scaling
Bohong Wu
Shen Yan
Sijun Zhang
Jianqiao Lu
Yutao Zeng
Ya Wang
Xun Zhou
477
0
0
21 Apr 2025
KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference
KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference
Yuxuan Tian
Zihan Wang
Yebo Peng
Aomufei Yuan
Zhaoxiang Wang
Bairen Yi
Xin Liu
Yong Cui
Tong Yang
75
0
0
14 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
109
0
0
14 Apr 2025
Adaptive Computation Pruning for the Forgetting Transformer
Adaptive Computation Pruning for the Forgetting Transformer
Zhixuan Lin
J. Obando-Ceron
Xu Owen He
Rameswar Panda
77
2
0
09 Apr 2025
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Gleb Rodionov
Roman Garipov
Alina Shutova
George Yakushev
Erik Schultheis
Vage Egiazarian
Anton Sinitsin
Denis Kuznedelev
Dan Alistarh
LRM
149
5
0
08 Apr 2025
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Manlai Liang
JiaMing Zhang
Xiong Li
Jinlong Li
MQ
74
1
0
07 Apr 2025
Saliency-driven Dynamic Token Pruning for Large Language Models
Saliency-driven Dynamic Token Pruning for Large Language Models
Yao Tao
Yehui Tang
Yun Wang
Mingjian Zhu
Hailin Hu
Yunhe Wang
137
2
0
06 Apr 2025
Cognitive Memory in Large Language Models
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAGKELM
525
3
0
03 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
102
0
0
01 Apr 2025
12345
Next