ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.14048
  4. Cited By
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
  Language Models
v1v2v3 (latest)

H2_22​O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

24 June 2023
Zhenyu Zhang
Ying Sheng
Dinesh Manocha
Tianlong Chen
Lianmin Zheng
Ruisi Cai
Zhao Song
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
    VLM
ArXiv (abs)PDFHTMLGithub (447★)

Papers citing "H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models"

15 / 215 papers shown
Title
BiSHop: Bi-Directional Cellular Learning for Tabular Data with
  Generalized Sparse Modern Hopfield Model
BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield Model
Chenwei Xu
Yu-Chao Huang
Jerry Yao-Chieh Hu
Weijian Li
Ammar Gilani
H. Goan
Han Liu
85
21
0
04 Apr 2024
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV
  Caching
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Youpeng Zhao
Di Wu
Jun Wang
96
28
0
26 Mar 2024
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Zeyu Han
Chao Gao
Jinyang Liu
Jeff Zhang
Sai Qian Zhang
310
405
0
21 Mar 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot
Adrian Lañcucki
Marcin Chochowski
David Tarjan
Edoardo Ponti
100
56
0
14 Mar 2024
LLM Inference Unveiled: Survey and Roofline Model Insights
LLM Inference Unveiled: Survey and Roofline Model Insights
Zhihang Yuan
Yuzhang Shang
Yang Zhou
Zhen Dong
Zhe Zhou
...
Yong Jae Lee
Yan Yan
Beidi Chen
Guangyu Sun
Kurt Keutzer
240
91
0
26 Feb 2024
SubGen: Token Generation in Sublinear Time and Memory
SubGen: Token Generation in Sublinear Time and Memory
A. Zandieh
Insu Han
Vahab Mirrokni
Amin Karbasi
74
18
0
08 Feb 2024
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an
  Efficient Context Memory
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
Chaojun Xiao
Pengle Zhang
Xu Han
Guangxuan Xiao
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Maosong Sun
LLMAG
132
55
0
07 Feb 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
94
11
0
05 Jan 2024
Punica: Multi-Tenant LoRA Serving
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
90
35
0
28 Oct 2023
How to Capture Higher-order Correlations? Generalizing Matrix Softmax
  Attention to Kronecker Computation
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman
Zhao Song
127
37
0
06 Oct 2023
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge
Yunan Zhang
Liyuan Liu
Minjia Zhang
Jiawei Han
Jianfeng Gao
85
260
0
03 Oct 2023
GradientCoin: A Peer-to-Peer Decentralized Large Language Models
GradientCoin: A Peer-to-Peer Decentralized Large Language Models
Yeqi Gao
Zhao Song
Junze Yin
89
18
0
21 Aug 2023
Zero-th Order Algorithm for Softmax Attention Optimization
Zero-th Order Algorithm for Softmax Attention Optimization
Yichuan Deng
Zhihang Li
Sridhar Mahadevan
Zhao Song
66
14
0
17 Jul 2023
Fast Quantum Algorithm for Attention Computation
Fast Quantum Algorithm for Attention Computation
Yeqi Gao
Zhao Song
Xin Yang
Ruizhe Zhang
LRM
83
23
0
16 Jul 2023
Fast Transformer Decoding: One Write-Head is All You Need
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
177
479
0
06 Nov 2019
Previous
12345