Do Transformers Need Deep Long-Range Memory

Do Transformers Need Deep Long-Range Memory

7 July 2020

Ali Razavi

Papers citing "Do Transformers Need Deep Long-Range Memory"

13 / 13 papers shown

Title
Are queries and keys always relevant? A case study on Transformer wave functions Riccardo Rende Luciano Loris Viteritti 31 5 0 29 May 2024
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget Zihao Wang Shaoduo Gan 37 6 0 07 Apr 2024
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention Kaiqiang Song Xiaoyang Wang Sangwoo Cho Xiaoman Pan Dong Yu 42 7 0 14 Dec 2023
Long-range Language Modeling with Self-retrieval Ohad Rubin Jonathan Berant RALM KELM 36 18 0 23 Jun 2023
Dissociating language and thought in large language models Kyle Mahowald Anna A. Ivanova I. Blank Nancy Kanwisher J. Tenenbaum Evelina Fedorenko ELM ReLM 31 209 0 16 Jan 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Tri Dao Daniel Y. Fu Stefano Ermon Atri Rudra Christopher Ré VLM 113 2,055 0 27 May 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition Chao-Yuan Wu Yanghao Li K. Mangalam Haoqi Fan Bo Xiong Jitendra Malik Christoph Feichtenhofer ViT 48 198 0 20 Jan 2022
Well Googled is Half Done: Multimodal Forecasting of New Fashion Product Sales with Image-based Google Trends Geri Skenderi Christian Joppi Matteo Denitto Marco Cristani AI4TS 27 24 0 20 Sep 2021
Do Long-Range Language Models Actually Use Long-Range Context? Simeng Sun Kalpesh Krishna Andrew Mattarella-Micke Mohit Iyyer RALM 25 82 0 19 Sep 2021
Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN Rahma Chaabouni Roberto Dessì Eugene Kharitonov 35 20 0 03 Jul 2021
Sparsifying Transformer Models with Trainable Representation Pooling Michal Pietruszka Łukasz Borchmann Lukasz Garncarek 23 10 0 10 Sep 2020
Efficient Content-Based Sparse Attention with Routing Transformers Aurko Roy M. Saffar Ashish Vaswani David Grangier MoE 255 580 0 12 Mar 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,833 0 17 Sep 2019