Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

22 July 2024

Papers citing "Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models"

7 / 7 papers shown

Title
Layer-Condensed KV Cache for Efficient Inference of Large Language Models Haoyi Wu Kewei Tu MQ 53 19 0 17 May 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference Piotr Nawrot Adrian Lañcucki Marcin Chochowski David Tarjan Edoardo Ponti 46 51 0 14 Mar 2024
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding Heming Xia Zhe Yang Qingxiu Dong Peiyi Wang Chak Tou Leong Tao Ge Tianyu Liu Wenjie Li Zhifang Sui LRM 72 115 0 15 Jan 2024
RWKV: Reinventing RNNs for the Transformer Era Bo Peng Eric Alcaide Quentin G. Anthony Alon Albalak Samuel Arcadinho ... Qihang Zhao P. Zhou Qinghua Zhou Jian Zhu Rui-Jie Zhu 149 581 0 22 May 2023
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth Yihe Dong Jean-Baptiste Cordonnier Andreas Loukas 64 376 0 05 Mar 2021
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping Minjia Zhang Yuxiong He AI4CE 24 101 0 26 Oct 2020
Reducing Transformer Depth on Demand with Structured Dropout Angela Fan Edouard Grave Armand Joulin 76 586 0 25 Sep 2019