ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.10516
  4. Cited By
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

3 January 2025
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
Qianxi Zhang
Qi Chen
Chengruidong Zhang
Bailu Ding
Kaipeng Zhang
Chong Chen
Fan Yang
Yuqing Yang
Lili Qiu
ArXivPDFHTML

Papers citing "RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval"

18 / 18 papers shown
Title
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Yushen Chen
J. Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
38
0
0
05 May 2025
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Yiming Du
Wenyu Huang
Danna Zheng
Zhaowei Wang
Sébastien Montella
Mirella Lapata
Kam-Fai Wong
Jeff Z. Pan
KELM
MU
83
2
0
01 May 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yuqing Yang
Lili Qiu
35
1
0
22 Apr 2025
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Yangshen Deng
Zhengxin You
Long Xiang
Qilong Li
Peiqi Yuan
...
Man Lung Yiu
Huan Li
Qiaomu Shen
Rui Mao
Bo Tang
42
0
0
14 Apr 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
66
0
0
12 Apr 2025
PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
Haofei Yin
Mengbai Xiao
Rouzhou Lu
Xiao Zhang
Dongxiao Yu
Guanghui Zhang
AI4CE
26
0
0
05 Apr 2025
PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference
PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference
Weisheng Jin
Maojia Song
Tej Deep Pala
Yew Ken Chia
Amir Zadeh
Chuan Li
Soujanya Poria
VLM
57
0
0
30 Mar 2025
AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation
Fengyu Li
Yilin Li
Junhao Zhu
Lu Chen
Yanfei Zhang
Jia Zhou
Hui Zu
Jingwen Zhao
Yunjun Gao
LLMAG
54
0
0
14 Mar 2025
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
G. Wang
Shubhangi Upasani
Chen Henry Wu
Darshan Gandhi
Jonathan Li
Changran Hu
Bo Li
Urmish Thakker
79
0
0
11 Mar 2025
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
Qihui Zhou
Peiqi Yin
Pengfei Zuo
James Cheng
CLL
40
1
0
01 Mar 2025
Machine learning and high dimensional vector search
Matthijs Douze
70
0
0
24 Feb 2025
Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference
Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference
Q. Xiao
Jiachuan Wang
Haoyang Li
Cheng Deng
Xiangbo Shu
Shuangyin Li
Yongqi Zhang
Jun Wang
Lei Chen
LLMSV
54
1
0
20 Feb 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
105
4
0
12 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
MQ
64
1
0
05 Feb 2025
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Twilight: Adaptive Attention Sparsity with Hierarchical Top-ppp Pruning
C. Lin
Jiaming Tang
Shuo Yang
Hanshuo Wang
Tian Tang
Boyu Tian
Ion Stoica
Enze Xie
Mingyu Gao
97
2
0
04 Feb 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Junming Ma
Jin Tan
Runsheng Wang
Meng Li
49
0
0
12 Jan 2025
Squeezed Attention: Accelerating Long Context Length LLM Inference
Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Monishwaran Maheswaran
June Paik
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
60
9
0
14 Nov 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via
  Dynamic Sparse Attention
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang
Yucheng Li
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Chin-Yew Lin
Yuqing Yang
L. Qiu
72
84
0
02 Jul 2024
1