ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1906.05909
  4. Cited By
Stand-Alone Self-Attention in Vision Models

Stand-Alone Self-Attention in Vision Models

13 June 2019
Prajit Ramachandran
Niki Parmar
Ashish Vaswani
Irwan Bello
Anselm Levskaya
Jonathon Shlens
    VLMSLRViT
ArXiv (abs)PDFHTML

Papers citing "Stand-Alone Self-Attention in Vision Models"

50 / 588 papers shown
Title
Learning to Estimate Hidden Motions with Global Motion Aggregation
Learning to Estimate Hidden Motions with Global Motion Aggregation
Shihao Jiang
Dylan Campbell
Yao Lu
Hongdong Li
Leonid Sigal
3DPC
145
297
0
06 Apr 2021
An Empirical Study of Training Self-Supervised Vision Transformers
An Empirical Study of Training Self-Supervised Vision Transformers
Xinlei Chen
Saining Xie
Kaiming He
ViT
183
1,874
0
05 Apr 2021
Group-Free 3D Object Detection via Transformers
Group-Free 3D Object Detection via Transformers
Ze Liu
Zheng Zhang
Yue Cao
Han Hu
Xin Tong
ViT3DPC
101
315
0
01 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
215
1,193
0
01 Apr 2021
Motion Guided Attention Fusion to Recognize Interactions from Videos
Motion Guided Attention Fusion to Recognize Interactions from Videos
Tae Soo Kim
Jonathan D. Jones
Gregory Hager
39
15
0
01 Apr 2021
Going deeper with Image Transformers
Going deeper with Image Transformers
Hugo Touvron
Matthieu Cord
Alexandre Sablayrolles
Gabriel Synnaeve
Hervé Jégou
ViT
193
1,025
0
31 Mar 2021
Rethinking Spatial Dimensions of Vision Transformers
Rethinking Spatial Dimensions of Vision Transformers
Byeongho Heo
Sangdoo Yun
Dongyoon Han
Sanghyuk Chun
Junsuk Choe
Seong Joon Oh
ViT
562
585
0
30 Mar 2021
ViViT: A Video Vision Transformer
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
240
2,175
0
29 Mar 2021
BA^2M: A Batch Aware Attention Module for Image Classification
BA^2M: A Batch Aware Attention Module for Image Classification
Qishang Cheng
Hongliang Li
Qi Wu
K. Ngan
39
6
0
28 Mar 2021
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
  Classification
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Chun-Fu Chen
Quanfu Fan
Yikang Shen
ViT
73
1,500
0
27 Mar 2021
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
  Object Localization
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization
Wei Gao
Fang Wan
Xingjia Pan
Zhiliang Peng
Qi Tian
Zhenjun Han
Bolei Zhou
QiXiang Ye
ViTWSOL
91
203
0
27 Mar 2021
Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using
  Spatial and Temporal Transformers
Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers
Tianyu Zhu
Markus Hiller
Mahsa Ehsanpour
Rongkai Ma
Tom Drummond
Ian Reid
Hamid Rezatofighi
VOT
87
36
0
27 Mar 2021
Understanding Robustness of Transformers for Image Classification
Understanding Robustness of Transformers for Image Classification
Srinadh Bhojanapalli
Ayan Chakrabarti
Daniel Glasner
Daliang Li
Thomas Unterthiner
Andreas Veit
ViT
116
392
0
26 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
517
21,773
0
25 Mar 2021
An Image is Worth 16x16 Words, What is a Video Worth?
An Image is Worth 16x16 Words, What is a Video Worth?
Gilad Sharir
Asaf Noy
Lihi Zelnik-Manor
ViT
100
125
0
25 Mar 2021
Vision Transformers for Dense Prediction
Vision Transformers for Dense Prediction
René Ranftl
Alexey Bochkovskiy
V. Koltun
ViTMDE
158
1,752
0
24 Mar 2021
SaccadeCam: Adaptive Visual Attention for Monocular Depth Sensing
SaccadeCam: Adaptive Visual Attention for Monocular Depth Sensing
Brevin Tilmon
S. Koppal
MDE
54
5
0
24 Mar 2021
Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Ashish Vaswani
Prajit Ramachandran
A. Srinivas
Niki Parmar
Blake A. Hechtman
Jonathon Shlens
109
404
0
23 Mar 2021
Instance-level Image Retrieval using Reranking Transformers
Instance-level Image Retrieval using Reranking Transformers
Fuwen Tan
Jiangbo Yuan
Vicente Ordonez
ViT
165
93
0
22 Mar 2021
DeepViT: Towards Deeper Vision Transformer
DeepViT: Towards Deeper Vision Transformer
Daquan Zhou
Bingyi Kang
Xiaojie Jin
Linjie Yang
Xiaochen Lian
Zihang Jiang
Qibin Hou
Jiashi Feng
ViT
130
525
0
22 Mar 2021
Incorporating Convolution Designs into Visual Transformers
Incorporating Convolution Designs into Visual Transformers
Kun Yuan
Shaopeng Guo
Ziwei Liu
Aojun Zhou
F. Yu
Wei Wu
ViT
115
484
0
22 Mar 2021
ConViT: Improving Vision Transformers with Soft Convolutional Inductive
  Biases
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Stéphane dÁscoli
Hugo Touvron
Matthew L. Leavitt
Ari S. Morcos
Giulio Biroli
Levent Sagun
ViT
143
835
0
19 Mar 2021
Scalable Vision Transformers with Hierarchical Pooling
Scalable Vision Transformers with Hierarchical Pooling
Zizheng Pan
Bohan Zhuang
Jing Liu
Haoyu He
Jianfei Cai
ViT
95
130
0
19 Mar 2021
UNETR: Transformers for 3D Medical Image Segmentation
UNETR: Transformers for 3D Medical Image Segmentation
Ali Hatamizadeh
Yucheng Tang
Vishwesh Nath
Dong Yang
Andriy Myronenko
Bennett Landman
H. Roth
Daguang Xu
ViTMedIm
198
1,634
0
18 Mar 2021
Revisiting ResNets: Improved Training and Scaling Strategies
Revisiting ResNets: Improved Training and Scaling Strategies
Irwan Bello
W. Fedus
Xianzhi Du
E. D. Cubuk
A. Srinivas
Nayeon Lee
Jonathon Shlens
Barret Zoph
98
302
0
13 Mar 2021
Involution: Inverting the Inherence of Convolution for Visual
  Recognition
Involution: Inverting the Inherence of Convolution for Visual Recognition
Duo Li
Jie Hu
Changhu Wang
Xiangtai Li
Qi She
Lei Zhu
Tong Zhang
Qifeng Chen
BDL
77
304
0
10 Mar 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
163
388
0
05 Mar 2021
Perceiver: General Perception with Iterative Attention
Perceiver: General Perception with Iterative Attention
Andrew Jaegle
Felix Gimeno
Andrew Brock
Andrew Zisserman
Oriol Vinyals
João Carreira
VLMViTMDE
214
1,029
0
04 Mar 2021
Generative Adversarial Transformers
Generative Adversarial Transformers
Drew A. Hudson
C. L. Zitnick
ViT
131
182
0
01 Mar 2021
DR-TANet: Dynamic Receptive Temporal Attention Network for Street Scene
  Change Detection
DR-TANet: Dynamic Receptive Temporal Attention Network for Street Scene Change Detection
Shuo Chen
Kailun Yang
Rainer Stiefelhagen
66
39
0
01 Mar 2021
Representation Learning for Event-based Visuomotor Policies
Representation Learning for Event-based Visuomotor Policies
Sai H. Vemprala
Sami Mian
Ashish Kapoor
55
23
0
01 Mar 2021
Nested-block self-attention for robust radiotherapy planning
  segmentation
Nested-block self-attention for robust radiotherapy planning segmentation
Harini Veeraraghavan
Jue Jiang
Elguindi Sharif
S. Berry
Ifeanyirochukwu Onochie
Aditya P. Apte
L. Cerviño
Joseph O. Deasy
93
3
0
26 Feb 2021
Iterative SE(3)-Transformers
Iterative SE(3)-Transformers
F. Fuchs
E. Wagstaff
Justas Dauparas
Ingmar Posner
64
17
0
26 Feb 2021
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
  without Convolutions
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang
Enze Xie
Xiang Li
Deng-Ping Fan
Kaitao Song
Ding Liang
Tong Lu
Ping Luo
Ling Shao
ViT
583
3,754
0
24 Feb 2021
Conditional Positional Encodings for Vision Transformers
Conditional Positional Encodings for Vision Transformers
Xiangxiang Chu
Zhi Tian
Bo Zhang
Xinlong Wang
Chunhua Shen
ViT
150
625
0
22 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
106
301
0
22 Feb 2021
Evolving Attention with Residual Convolutions
Evolving Attention with Residual Convolutions
Yujing Wang
Yaming Yang
Jiangang Bai
Mingliang Zhang
Jing Bai
Jiahao Yu
Ce Zhang
Gao Huang
Yunhai Tong
ViT
112
34
0
20 Feb 2021
Hard-Attention for Scalable Image Classification
Hard-Attention for Scalable Image Classification
Athanasios Papadopoulos
Pawel Korus
N. Memon
114
25
0
20 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
LambdaNetworks: Modeling Long-Range Interactions Without Attention
Irwan Bello
359
181
0
17 Feb 2021
Learning Self-Similarity in Space and Time as Generalized Motion for
  Video Action Recognition
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition
Heeseung Kwon
Manjin Kim
Suha Kwak
Minsu Cho
TTA
79
41
0
14 Feb 2021
Training Vision Transformers for Image Retrieval
Training Vision Transformers for Image Retrieval
Alaaeldin El-Nouby
Natalia Neverova
Ivan Laptev
Hervé Jégou
ViT
139
159
0
10 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
420
2,075
0
09 Feb 2021
CKConv: Continuous Kernel Convolution For Sequential Data
CKConv: Continuous Kernel Convolution For Sequential Data
David W. Romero
Anna Kuzina
Erik J. Bekkers
Jakub M. Tomczak
Mark Hoogendoorn
75
126
0
04 Feb 2021
Relaxed Transformer Decoders for Direct Action Proposal Generation
Relaxed Transformer Decoders for Direct Action Proposal Generation
Jing Tan
Jiaqi Tang
Limin Wang
Gangshan Wu
ViT
156
182
0
03 Feb 2021
Self-Attention Meta-Learner for Continual Learning
Self-Attention Meta-Learner for Continual Learning
Ghada Sokar
Decebal Constantin Mocanu
Mykola Pechenizkiy
CLL
53
11
0
28 Jan 2021
Tokens-to-Token ViT: Training Vision Transformers from Scratch on
  ImageNet
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li-xin Yuan
Yunpeng Chen
Tao Wang
Weihao Yu
Yujun Shi
Zihang Jiang
Francis E. H. Tay
Jiashi Feng
Shuicheng Yan
ViT
192
1,953
0
28 Jan 2021
CNN with large memory layers
CNN with large memory layers
R. Karimov
Yury Malkov
Karim Iskakov
Victor Lempitsky
44
0
0
27 Jan 2021
Bottleneck Transformers for Visual Recognition
Bottleneck Transformers for Visual Recognition
A. Srinivas
Nayeon Lee
Niki Parmar
Jonathon Shlens
Pieter Abbeel
Ashish Vaswani
SLR
373
997
0
27 Jan 2021
Channelized Axial Attention for Semantic Segmentation -- Considering
  Channel Relation within Spatial Attention for Semantic Segmentation
Channelized Axial Attention for Semantic Segmentation -- Considering Channel Relation within Spatial Attention for Semantic Segmentation
Ye Huang
Di Kang
W. Jia
Xiangjian He
Liu Liu
169
36
0
19 Jan 2021
Context-aware Attentional Pooling (CAP) for Fine-grained Visual
  Classification
Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification
Ardhendu Behera
Zachary Wharton
Pradeep Ruwan Padmasiri Galbokka Hewage
Asish Bera
118
110
0
17 Jan 2021
Previous
123...1011129
Next