ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.03356
  4. Cited By
Do Transformers Need Deep Long-Range Memory

Do Transformers Need Deep Long-Range Memory

7 July 2020
Jack W. Rae
Ali Razavi
    RALM
ArXivPDFHTML

Papers citing "Do Transformers Need Deep Long-Range Memory"

13 / 13 papers shown
Title
Are queries and keys always relevant? A case study on Transformer wave functions
Are queries and keys always relevant? A case study on Transformer wave functions
Riccardo Rende
Luciano Loris Viteritti
31
5
0
29 May 2024
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via
  Layer-wise Optimal Budget
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Zihao Wang
Shaoduo Gan
37
6
0
07 Apr 2024
Zebra: Extending Context Window with Layerwise Grouped Local-Global
  Attention
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention
Kaiqiang Song
Xiaoyang Wang
Sangwoo Cho
Xiaoman Pan
Dong Yu
42
7
0
14 Dec 2023
Long-range Language Modeling with Self-retrieval
Long-range Language Modeling with Self-retrieval
Ohad Rubin
Jonathan Berant
RALM
KELM
36
18
0
23 Jun 2023
Dissociating language and thought in large language models
Dissociating language and thought in large language models
Kyle Mahowald
Anna A. Ivanova
I. Blank
Nancy Kanwisher
J. Tenenbaum
Evelina Fedorenko
ELM
ReLM
31
209
0
16 Jan 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with
  IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
113
2,055
0
27 May 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
  Long-Term Video Recognition
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
48
198
0
20 Jan 2022
Well Googled is Half Done: Multimodal Forecasting of New Fashion Product
  Sales with Image-based Google Trends
Well Googled is Half Done: Multimodal Forecasting of New Fashion Product Sales with Image-based Google Trends
Geri Skenderi
Christian Joppi
Matteo Denitto
Marco Cristani
AI4TS
27
24
0
20 Sep 2021
Do Long-Range Language Models Actually Use Long-Range Context?
Do Long-Range Language Models Actually Use Long-Range Context?
Simeng Sun
Kalpesh Krishna
Andrew Mattarella-Micke
Mohit Iyyer
RALM
25
82
0
19 Sep 2021
Can Transformers Jump Around Right in Natural Language? Assessing
  Performance Transfer from SCAN
Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN
Rahma Chaabouni
Roberto Dessì
Eugene Kharitonov
35
20
0
03 Jul 2021
Sparsifying Transformer Models with Trainable Representation Pooling
Sparsifying Transformer Models with Trainable Representation Pooling
Michal Pietruszka
Łukasz Borchmann
Lukasz Garncarek
23
10
0
10 Sep 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
255
580
0
12 Mar 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
245
1,833
0
17 Sep 2019
1