Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.02441
Cited By
Cognitive Memory in Large Language Models
3 April 2025
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAG
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Cognitive Memory in Large Language Models"
50 / 72 papers shown
Title
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models
Zhiyu Li
Shichao Song
Hanyu Wang
Simin Niu
Ding Chen
...
Qingchen Yu
Bo Tang
Hongkang Yang
Zhi-hai Xu
Feiyu Xiong
RALM
32
0
0
28 May 2025
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
55
11
0
08 Sep 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Jian Chen
Vashisth Tiwari
Ranajoy Sadhukhan
Zhuoming Chen
Jinyuan Shi
Ian En-Hsu Yen
Ian En-Hsu Yen
Avner May
Tianqi Chen
Beidi Chen
LRM
95
27
0
20 Aug 2024
Cross-layer Attention Sharing for Large Language Models
Yongyu Mu
Yuzhang Wu
Yuchun Fan
Chenglong Wang
Hengyu Li
Qiaozhi He
Murun Yang
Tong Xiao
Jingbo Zhu
52
5
0
04 Aug 2024
Palu: Compressing KV-Cache with Low-Rank Projection
Chi-Chih Chang
Wei-Cheng Lin
Chien-Yu Lin
Chong-Yan Chen
Yu-Fang Hu
Pei-Shuo Wang
N. Huang
Luis Ceze
Kai-Chiang Wu
66
1
0
30 Jul 2024
A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder
Hyun Rae Jo
Dong Kun Shin
52
5
0
30 Jul 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu
Zhanming Jie
Hanze Dong
Lei Wang
Xudong Lu
Aojun Zhou
Amrita Saha
Caiming Xiong
Doyen Sahoo
109
20
0
30 Jul 2024
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang
Yang Lin
Jing Lin
Qingsen Han
Shikuan Hong
Yiwu Yao
Gongyi Wang
MQ
64
33
0
22 Jul 2024
Beyond KV Caching: Shared Attention for Efficient LLMs
Bingli Liao
Danilo Vasconcellos Vargas
32
5
0
13 Jul 2024
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
Zheng Wang
Boxiao Jin
Zhongzhi Yu
Minjia Zhang
MoMe
69
28
0
11 Jul 2024
Efficient Sparse Attention needs Adaptive Token Release
Chaoran Zhang
Lixin Zou
Dan Luo
Min Tang
Xiangyang Luo
Zihao Li
Chenliang Li
75
5
0
02 Jul 2024
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Cunchen Hu
Heyang Huang
Junhao Hu
Jiang Xu
Xusheng Chen
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
LLMAG
77
27
0
25 Jun 2024
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Tianyu Fu
Haofeng Huang
Xuefei Ning
Genghan Zhang
Boju Chen
...
Shiyao Li
Shengen Yan
Guohao Dai
Huazhong Yang
Yu Wang
MQ
83
20
0
21 Jun 2024
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
Zhiyu Guo
Hidetaka Kamigaito
Taro Watanabe
51
24
0
18 Jun 2024
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
Zhongwei Wan
Xinjian Wu
Yu Zhang
Yi Xin
Chaofan Tao
...
Xin Wang
Siqi Luo
Jing Xiong
Mi Zhang
Mi Zhang
72
0
0
18 Jun 2024
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling
Yu Bai
Xiyuan Zou
Heyan Huang
Sanxing Chen
Marc-Antoine Rondeau
Yang Gao
Jackie Chi Kit Cheung
52
7
0
17 Jun 2024
Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens
Weiyao Luo
Suncong Zheng
Heming Xia
Weikang Wang
Yan Lei
Tianyu Liu
Shuang Chen
Zhifang Sui
66
1
0
16 Jun 2024
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Jiaming Tang
Yilong Zhao
Kan Zhu
Guangxuan Xiao
Baris Kasikci
Song Han
71
93
0
16 Jun 2024
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zayd Muhammad Kawakibi Zuhri
Muhammad Farid Adilazuarda
Ayu Purwarianti
Alham Fikri Aji
63
10
0
13 Jun 2024
Effectively Compress KV Heads for LLM
Hao Yu
Zelan Yang
Shen Li
Yong Li
Jianxin Wu
MQ
VLM
52
15
0
11 Jun 2024
Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation
A. AadharshAadhithya
S. SachinKumar
Soman K. P
RALM
72
2
0
10 Jun 2024
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai
Yichi Zhang
Bofei Gao
Yuliang Liu
Yongqian Li
...
Wayne Xiong
Yue Dong
Baobao Chang
Junjie Hu
Wen Xiao
115
97
0
04 Jun 2024
Toward Conversational Agents with Context and Time Sensitive Long-term Memory
Nick Alonso
Tomás Figliolia
A. Ndirango
Beren Millidge
RALM
3DV
81
3
0
29 May 2024
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
Jiayi Yao
Hanchen Li
Yuhan Liu
Siddhant Ray
Yihua Cheng
Qizheng Zhang
Kuntai Du
Shan Lu
Junchen Jiang
76
22
0
26 May 2024
SirLLM: Streaming Infinite Retentive LLM
Yao Yao
Z. Li
Hai Zhao
KELM
RALM
59
11
0
21 May 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
Ramachandran Ramjee
Ashish Panwar
VLM
109
28
0
07 May 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI
Aixin Liu
Bei Feng
Bin Wang
Bingxuan Wang
...
Zhuoshu Li
Zihan Wang
Zihui Gu
Zilin Li
Ziwei Xie
MoE
95
462
0
07 May 2024
Efficient LLM Inference with Kcache
Qiaozhi He
Zhihua Wu
RALM
68
1
0
28 Apr 2024
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li
Yingbing Huang
Bowen Yang
Bharat Venkitesh
Acyr Locatelli
Hanchen Ye
Tianle Cai
Patrick Lewis
Deming Chen
VLM
110
192
0
22 Apr 2024
TransformerFAM: Feedback attention is working memory
Dongseong Hwang
Weiran Wang
Zhuoyuan Huo
K. Sim
P. M. Mengibar
70
12
0
14 Apr 2024
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Zihao Wang
Shaoduo Gan
51
6
0
07 Apr 2024
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Youpeng Zhao
Di Wu
Jun Wang
57
28
0
26 Mar 2024
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
Jiaao He
Jidong Zhai
57
33
0
18 Mar 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot
Adrian Lañcucki
Marcin Chochowski
David Tarjan
Edoardo Ponti
72
54
0
14 Mar 2024
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Muhammad Adnan
Akhil Arunkumar
Gaurav Jain
Prashant J. Nair
Ilya Soloveychik
Purushotham Kamath
57
58
0
14 Mar 2024
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Sainbayar Sukhbaatar
O. Yu. Golovneva
Vasu Sharma
Hu Xu
Xi Lin
...
Jacob Kahn
Shang-Wen Li
Wen-tau Yih
Jason Weston
Xian Li
MoMe
OffRL
MoE
63
65
0
12 Mar 2024
Training-Free Long-Context Scaling of Large Language Models
Chen An
Fei Huang
Jun Zhang
Shansan Gong
Xipeng Qiu
Chang Zhou
Lingpeng Kong
ALM
LRM
62
41
0
27 Feb 2024
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Lu Ye
Ze Tao
Yong Huang
Yang Li
48
29
0
23 Feb 2024
LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Yi Lu
Xin Zhou
Wei He
Jun Zhao
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
LLMAG
53
11
0
16 Feb 2024
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Kuang-Huei Lee
Xinyun Chen
Hiroki Furuta
John F. Canny
Ian S. Fischer
RALM
88
34
0
15 Feb 2024
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
Harry Dong
Xinyu Yang
Zhenyu Zhang
Zhangyang Wang
Yuejie Chi
Beidi Chen
54
54
0
14 Feb 2024
Commonsense-augmented Memory Construction and Management in Long-term Conversations via Context-aware Persona Refinement
Hana Kim
Kai Tzu-iunn Ong
Seoyeon Kim
Dongha Lee
Jinyoung Yeo
35
8
0
25 Jan 2024
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Bin Lin
Chen Zhang
Tao Peng
Hanyu Zhao
Wencong Xiao
...
Shen Li
Zhigang Ji
Tao Xie
Yong Li
Wei Lin
63
52
0
05 Jan 2024
Empowering Working Memory for Large Language Model Agents
Jing Guo
Nan Li
J. Qi
Hang Yang
Ruiqiao Li
Yuzhen Feng
Si Zhang
Ming Xu
LLMAG
70
14
0
22 Dec 2023
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention
Kaiqiang Song
Xiaoyang Wang
Sangwoo Cho
Xiaoman Pan
Dong Yu
61
7
0
14 Dec 2023
Compressed Context Memory For Online Language Model Interaction
Jang-Hyun Kim
Junyoung Yeom
Sangdoo Yun
Hyun Oh Song
KELM
62
15
1
06 Dec 2023
Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory
Lei Liu
Xiaoyan Yang
Yue Shen
Binbin Hu
Qing Cui
Jinjie Gu
Guannan Zhang
LRM
LLMAG
KELM
75
20
0
15 Nov 2023
Efficient Streaming Language Models with Attention Sinks
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TS
RALM
119
750
0
29 Sep 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
158
2,196
0
12 Sep 2023
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
Qingyue Wang
Y. Fu
Yanan Cao
Zhiliang Tian
Shi Wang
Dacheng Tao
LLMAG
KELM
RALM
115
26
0
29 Aug 2023
1
2
Next