Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.05254
Cited By
v1
v2 (latest)
You Only Cache Once: Decoder-Decoder Architectures for Language Models
8 May 2024
Yutao Sun
Li Dong
Yi Zhu
Shaohan Huang
Wenhui Wang
Shuming Ma
Quanlu Zhang
Jianyong Wang
Furu Wei
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"You Only Cache Once: Decoder-Decoder Architectures for Language Models"
23 / 23 papers shown
Title
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
Yixuan Wang
Shiyu Ji
Yijun Liu
Yuzhuang Xu
Yang Xu
Qingfu Zhu
Wanxiang Che
46
0
0
24 May 2025
ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training
Maryam Dialameh
Rezaul Karim
Hossein Rajabzadeh
Omar Mohamed Awad
Hyock Ju Kwon
Boxing Chen
Walid Ahmed
Yang Liu
76
0
0
22 May 2025
Chain-of-Model Learning for Language Model
Kaitao Song
Xiaohua Wang
Xu Tan
Huiqiang Jiang
Chengruidong Zhang
...
Xiaoqing Zheng
Tao Qin
Yuqing Yang
Dongsheng Li
Lili Qiu
LRM
AI4CE
166
1
0
17 May 2025
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
Youhui Zuo
Sibo Wei
C. Zhang
Zhuorui Liu
Wenpeng Lu
Dawei Song
VLM
108
0
0
23 Mar 2025
GPU-Accelerated Motion Planning of an Underactuated Forestry Crane in Cluttered Environments
M. Vu
Gerald Ebmer
Alexander Watcher
Marc-Philip Ecker
Giang Nguyen
Tobias Glueck
125
3
0
18 Mar 2025
Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques
Neusha Javidnia
B. Rouhani
F. Koushanfar
538
0
0
14 Mar 2025
Liger: Linearizing Large Language Models to Gated Recurrent Structures
Disen Lan
Weigao Sun
Jiaxi Hu
Jusen Du
Yu Cheng
133
1
0
03 Mar 2025
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu
Zhenheng Tang
Hong Chen
Peijie Dong
Zeyu Li
Xiuze Zhou
Bo Li
Xuming Hu
Xiaowen Chu
457
7
0
04 Feb 2025
Parallel Key-Value Cache Fusion for Position Invariant RAG
Philhoon Oh
Jinwoo Shin
James Thorne
3DV
482
0
0
13 Jan 2025
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Sangmin Bae
Adam Fisch
Hrayr Harutyunyan
Ziwei Ji
Seungyeon Kim
Tal Schuster
KELM
129
7
0
28 Oct 2024
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
You Wu
Haoyi Wu
Kewei Tu
70
3
0
18 Oct 2024
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Bokai Lin
Zihao Zeng
Zipeng Xiao
Siqi Kou
Tianqi Hou
Xiaofeng Gao
Hao Zhang
Zhijie Deng
73
6
0
16 Oct 2024
How to Train Long-Context Language Models (Effectively)
Tianyu Gao
Alexander Wettig
Howard Yen
Danqi Chen
RALM
166
48
0
03 Oct 2024
Gated Linear Attention Transformers with Hardware-Efficient Training
Aaron Courville
Bailin Wang
Songlin Yang
Yikang Shen
Yoon Kim
116
180
0
11 Dec 2023
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng
Jeffrey Quesnelle
Honglu Fan
Enrico Shippole
OSLM
81
264
0
31 Aug 2023
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
362
1,094
0
05 Oct 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
148
644
0
22 Aug 2022
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLM
MLLM
MoE
102
559
0
03 Nov 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
327
2,533
0
20 Apr 2021
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
638
4,921
0
23 Jan 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
488
20,342
0
23 Oct 2019
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
135
1,916
0
23 Apr 2019
Group Normalization
Yuxin Wu
Kaiming He
245
3,672
0
22 Mar 2018
1