ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.14048
  4. Cited By
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
  Language Models
v1v2v3 (latest)

H2_22​O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

24 June 2023
Zhenyu Zhang
Ying Sheng
Dinesh Manocha
Tianlong Chen
Lianmin Zheng
Ruisi Cai
Zhao Song
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
    VLM
ArXiv (abs)PDFHTMLGithub (447★)

Papers citing "H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models"

50 / 215 papers shown
Title
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
183
2
0
18 Dec 2024
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong
Chengyao Wang
Yuqi Liu
Senqiao Yang
Longxiang Tang
...
Shaozuo Yu
Sitong Wu
Eric Lo
Shu Liu
Jiaya Jia
AuLLM
162
7
0
12 Dec 2024
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM
  Inference
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
Weizhuo Li
Zhigang Wang
Yu Gu
Ge Yu
MQ
120
0
0
08 Dec 2024
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following
  Models Need for Efficient Generation
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang
Hui Chen
Jianchao Tan
Kai Zhang
Xunliang Cai
Zijia Lin
Jiawei Han
Guiguang Ding
VLM
177
3
0
04 Dec 2024
Unifying KV Cache Compression for Large Language Models with LeanKV
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
288
7
0
04 Dec 2024
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Guangda Liu
Chong Li
Jieru Zhao
Chenqi Zhang
Minyi Guo
117
13
0
04 Dec 2024
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Akshat Sharma
Hangliang Ding
Jianping Li
Neel Dani
Minjia Zhang
172
1
0
27 Nov 2024
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped
  Activation Data Format
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Chao Fang
Man Shi
Robin Geens
Arne Symons
Zhongfeng Wang
Marian Verhelst
154
2
0
24 Nov 2024
Membership Inference Attack against Long-Context Large Language Models
Zixiong Wang
Gaoyang Liu
Yang Yang
Chen Wang
154
1
0
18 Nov 2024
Squeezed Attention: Accelerating Long Context Length LLM Inference
Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Monishwaran Maheswaran
June Paik
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
Amir Gholami
178
16
0
14 Nov 2024
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters
  for Efficient LLM Inference
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
Junqi Zhao
Zhijin Fang
Shu Li
Shaohui Yang
Shichao He
76
3
0
30 Oct 2024
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for
  Vision-Language Model Inference Acceleration
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
Dezhan Tu
Danylo Vashchilenko
Yuzhe Lu
Panpan Xu
VLM
92
12
0
29 Oct 2024
Survey of User Interface Design and Interaction Techniques in Generative
  AI Applications
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
Reuben Luera
Ryan Rossi
Alexa F. Siu
Franck Dernoncourt
Tong Yu
...
Hanieh Salehy
Jian Zhao
Samyadeep Basu
Puneet Mathur
Nedim Lipka
AI4TS
139
1
0
28 Oct 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Yiyuan Ma
Wenlei Bao
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
165
21
0
28 Oct 2024
RobustKV: Defending Large Language Models against Jailbreak Attacks via
  KV Eviction
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
Tanqiu Jiang
Zian Wang
Jiacheng Liang
Changjiang Li
Yuhui Wang
Ting Wang
AAML
85
6
0
25 Oct 2024
Not All Heads Matter: A Head-Level KV Cache Compression Method with
  Integrated Retrieval and Reasoning
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Yu Fu
Zefan Cai
Abedelkadir Asi
Wayne Xiong
Yue Dong
Wen Xiao
101
30
0
25 Oct 2024
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with
  System Co-Design
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
Ruisi Cai
Yeonju Ro
Geon-Woo Kim
Peihao Wang
Babak Ehteshami Bejnordi
Aditya Akella
Ziyi Wang
MoE
80
6
0
24 Oct 2024
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
Yifei Yang
Zouying Cao
Qiguang Chen
L. Qin
Dongjie Yang
Hai Zhao
Zhi Chen
66
6
0
24 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Zeang Sheng
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
135
46
0
22 Oct 2024
Harnessing Your DRAM and SSD for Sustainable and Accessible LLM
  Inference with Mixed-Precision and Multi-level Caching
Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
Jie Peng
Zhang Cao
Huaizhi Qu
Zhengyu Zhang
Chang Guo
Yanyong Zhang
Zhichao Cao
Tianlong Chen
112
2
0
17 Oct 2024
An Evolved Universal Transformer Memory
An Evolved Universal Transformer Memory
Edoardo Cetin
Qi Sun
Tianyu Zhao
Yujin Tang
511
0
0
17 Oct 2024
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
Zhongxiang Sun
Xiaoxue Zang
Kai Zheng
Yang Song
Jun Xu
Xiao Zhang
Weijie Yu
Yang Song
Han Li
128
17
0
15 Oct 2024
In-context KV-Cache Eviction for LLMs via Attention-Gate
In-context KV-Cache Eviction for LLMs via Attention-Gate
Zihao Zeng
Bokai Lin
Tianqi Hou
Hao Zhang
Zhijie Deng
125
2
0
15 Oct 2024
HSR-Enhanced Sparse Attention Acceleration
HSR-Enhanced Sparse Attention Acceleration
Bo Chen
Yingyu Liang
Zhizhou Sha
Zhenmei Shi
Zhao Song
269
22
0
14 Oct 2024
ZipVL: Efficient Large Vision-Language Models with Dynamic Token
  Sparsification and KV Cache Compression
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression
Yefei He
Feng Chen
Jing Liu
Wenqi Shao
Hong Zhou
Kai Zhang
Bohan Zhuang
VLM
98
13
0
11 Oct 2024
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed
  KV Caches for Chunked Text
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
Songshuo Lu
Hua Wang
Yutian Rong
Zhi Chen
Yaohua Tang
VLM
88
18
0
10 Oct 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
87
6
0
08 Oct 2024
Mixture Compressor for Mixture-of-Experts LLMs Gains More
Mixture Compressor for Mixture-of-Experts LLMs Gains More
Wei Huang
Yue Liao
Jianhui Liu
Ruifei He
Haoru Tan
Shiming Zhang
Hongsheng Li
Si Liu
Xiaojuan Qi
MoE
117
4
0
08 Oct 2024
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent
  Sparse Attention
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang
Zhihao Zhang
Zhuofu Chen
Zikun Li
Zhihao Jia
74
4
0
07 Oct 2024
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
Jinhao Li
Jiaming Xu
Shan Huang
Yonghua Chen
Wen Li
...
Jiayi Pan
Li Ding
Hao Zhou
Yu Wang
Guohao Dai
201
21
0
06 Oct 2024
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive
  Compression Strategy
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Rongzhi Zhang
Kuang Wang
Liyuan Liu
Shuohang Wang
Hao Cheng
Chao Zhang
Yelong Shen
MQ
92
12
0
04 Oct 2024
UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling
  for Retrieval-Augmented Generation
UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation
Zixuan Li
Jing Xiong
Fanghua Ye
Chuanyang Zheng
Xun Wu
...
Xiaodan Liang
Chengming Li
Zhenan Sun
Lingpeng Kong
Ngai Wong
RALMUQLM
107
2
0
03 Oct 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Yuxiang Huang
Binhang Yuan
Xu Han
Chaojun Xiao
Zhiyuan Liu
RALM
175
3
0
02 Oct 2024
A Little Goes a Long Way: Efficient Long Context Training and Inference
  with Partial Contexts
A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
Suyu Ge
Xihui Lin
Yunan Zhang
Jiawei Han
Hao Peng
117
4
0
02 Oct 2024
House of Cards: Massive Weights in LLMs
House of Cards: Massive Weights in LLMs
Jaehoon Oh
Seungjun Shin
Dokwan Oh
125
1
0
02 Oct 2024
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV
  Cache Management
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
Yi Xiong
Hao Wu
Changxu Shao
Ziqing Wang
Rui Zhang
Yuhong Guo
Junping Zhao
Ke Zhang
Zhenxuan Pan
80
6
0
01 Oct 2024
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
  with 1000x Input Token Reduction
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
Zhenmei Shi
Yifei Ming
Xuan-Phi Nguyen
Yingyu Liang
Shafiq Joty
130
33
0
25 Sep 2024
Decoding Large-Language Models: A Systematic Overview of Socio-Technical
  Impacts, Constraints, and Emerging Questions
Decoding Large-Language Models: A Systematic Overview of Socio-Technical Impacts, Constraints, and Emerging Questions
Zeyneb N. Kaya
Souvick Ghosh
55
0
0
25 Sep 2024
Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool
  and Depth-Anything Constraint
Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint
Sixiang Chen
Tian-Chun Ye
Peng Sun
Zhaohu Xing
Yunlong Lin
Lei Zhu
DiffM
81
12
0
24 Sep 2024
Inference-Friendly Models With MixAttention
Inference-Friendly Models With MixAttention
Shashank Rajput
Ying Sheng
Sean Owen
Vitaliy Chiley
165
2
0
23 Sep 2024
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
Zeyu Zhang
Haiying Shen
VLM
80
1
0
23 Sep 2024
EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language
  Models
EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models
Hossein Rajabzadeh
A. Jafari
Aman Sharma
Benyamin Jami
Hyock Ju Kwon
Ali Ghodsi
Boxing Chen
Mehdi Rezagholizadeh
63
0
0
22 Sep 2024
Towards LifeSpan Cognitive Systems
Towards LifeSpan Cognitive Systems
Yu Wang
Chi Han
Tongtong Wu
Xiaoxin He
Wangchunshu Zhou
...
Zexue He
Wei Wang
Gholamreza Haffari
Heng Ji
Julian McAuley
KELMCLL
486
2
0
20 Sep 2024
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling
  Acceleration in LLMs
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv
Yuan Feng
Xike Xie
Xin Jia
Qirong Peng
Guiming Xie
82
4
0
19 Sep 2024
Art and Science of Quantizing Large-Scale Models: A Comprehensive
  Overview
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
Yanshu Wang
Tong Yang
Xiyan Liang
Guoan Wang
Hanning Lu
Xu Zhe
Yaoming Li
Li Weitao
MQ
94
3
0
18 Sep 2024
Do Large Language Models Need a Content Delivery Network?
Do Large Language Models Need a Content Delivery Network?
Yihua Cheng
Kuntai Du
Jiayi Yao
Junchen Jiang
KELM
98
5
0
16 Sep 2024
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context
  Scenarios
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
Luning Wang
Shiyao Li
Xuefei Ning
Zhihang Yuan
Shengen Yan
Guohao Dai
Yu Wang
76
0
0
16 Sep 2024
Schrodinger's Memory: Large Language Models
Schrodinger's Memory: Large Language Models
Wei Wang
Qing Li
83
1
0
16 Sep 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective
  Long-Context LLM Inference
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
86
12
0
08 Sep 2024
Intelligent Router for LLM Workloads: Improving Performance Through
  Workload-Aware Scheduling
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling
Kunal Jain
Anjaly Parayil
Ankur Mallick
Esha Choukse
Xiaoting Qin
...
Chetan Bansal
Victor Rühle
Anoop Kulkarni
Steve Kofsky
Saravan Rajmohan
93
2
0
24 Aug 2024
Previous
12345
Next