ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11227
  4. Cited By
Multiscale Vision Transformers

Multiscale Vision Transformers

22 April 2021
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
    ViT
ArXivPDFHTML

Papers citing "Multiscale Vision Transformers"

50 / 736 papers shown
Title
COMPOSER: Compositional Reasoning of Group Activity in Videos with
  Keypoint-Only Modality
COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
Honglu Zhou
Asim Kadav
Aviv Shamsian
Shijie Geng
Farley Lai
Long Zhao
Tingxi Liu
Mubbasir Kapadia
H. Graf
19
23
0
11 Dec 2021
Spatio-temporal Relation Modeling for Few-shot Action Recognition
Spatio-temporal Relation Modeling for Few-shot Action Recognition
Anirudh Thatipelli
Sanath Narayan
Salman Khan
Rao Muhammad Anwer
F. Khan
Bernard Ghanem
ViT
25
88
0
09 Dec 2021
DualFormer: Local-Global Stratified Transformer for Efficient Video
  Recognition
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Yuxuan Liang
Pan Zhou
Roger Zimmermann
Shuicheng Yan
ViT
23
21
0
09 Dec 2021
BEVT: BERT Pretraining of Video Transformers
BEVT: BERT Pretraining of Video Transformers
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
ViT
36
203
0
02 Dec 2021
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
48
677
0
02 Dec 2021
Self-supervised Video Transformer
Self-supervised Video Transformer
Kanchana Ranasinghe
Muzammal Naseer
Salman Khan
F. Khan
Michael S. Ryoo
ViT
33
84
0
02 Dec 2021
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
Lingchen Meng
Hengduo Li
Bor-Chun Chen
Shiyi Lan
Zuxuan Wu
Yu-Gang Jiang
Ser-Nam Lim
ViT
25
219
0
30 Nov 2021
Adaptive Token Sampling For Efficient Vision Transformers
Adaptive Token Sampling For Efficient Vision Transformers
Mohsen Fayyaz
Soroush Abbasi Koohpayegani
F. Jafari
Sunando Sengupta
Hamid Reza Vaezi Joze
Eric Sommerlade
Hamed Pirsiavash
Juergen Gall
ViT
16
146
0
30 Nov 2021
Video Frame Interpolation Transformer
Video Frame Interpolation Transformer
Zhihao Shi
Xiangyu Xu
Xiaohong Liu
Jun Chen
Ming-Hsuan Yang
ViT
17
157
0
27 Nov 2021
SWAT: Spatial Structure Within and Among Tokens
SWAT: Spatial Structure Within and Among Tokens
Kumara Kahatapitiya
Michael S. Ryoo
25
6
0
26 Nov 2021
Weakly-guided Self-supervised Pretraining for Temporal Activity
  Detection
Weakly-guided Self-supervised Pretraining for Temporal Activity Detection
Kumara Kahatapitiya
Zhou Ren
Haoxiang Li
Zhenyu Wu
Michael S. Ryoo
G. Hua
ViT
28
6
0
26 Nov 2021
BoxeR: Box-Attention for 2D and 3D Transformers
BoxeR: Box-Attention for 2D and 3D Transformers
Duy-Kien Nguyen
Jihong Ju
Olaf Booji
Martin R. Oswald
Cees G. M. Snoek
ViT
28
36
0
25 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
35
73
0
25 Nov 2021
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal
  Representation Learning
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
David Junhao Zhang
Kunchang Li
Yali Wang
Yuxiang Chen
Shashwat Chandra
Yu Qiao
Luoqi Liu
Mike Zheng Shou
AI4TS
27
30
0
24 Nov 2021
PhysFormer: Facial Video-based Physiological Measurement with Temporal
  Difference Transformer
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer
Zitong Yu
Yuming Shen
Jingang Shi
Hengshuang Zhao
Philip H. S. Torr
Guoying Zhao
ViT
MedIm
137
167
0
23 Nov 2021
Efficient Video Transformers with Spatial-Temporal Token Selection
Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
21
63
0
23 Nov 2021
Benchmarking Detection Transfer Learning with Vision Transformers
Benchmarking Detection Transfer Learning with Vision Transformers
Yanghao Li
Saining Xie
Xinlei Chen
Piotr Dollar
Kaiming He
Ross B. Girshick
17
165
0
22 Nov 2021
PyTorchVideo: A Deep Learning Library for Video Understanding
PyTorchVideo: A Deep Learning Library for Video Understanding
Haoqi Fan
Tullie Murrell
Heng Wang
Kalyan Vasudev Alwala
Yanghao Li
...
Ross B. Girshick
Matt Feiszli
Aaron B. Adcock
Wan-Yen Lo
Christoph Feichtenhofer
VLM
ViT
26
49
0
18 Nov 2021
Evaluating Transformers for Lightweight Action Recognition
Evaluating Transformers for Lightweight Action Recognition
Raivo Koot
Markus Hennerbichler
Haiping Lu
ViT
28
8
0
18 Nov 2021
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
Jiaqi Gu
Hyoukjun Kwon
Dilin Wang
Wei Ye
Meng Li
Yu-Hsin Chen
Liangzhen Lai
Vikas Chandra
D. Pan
ViT
24
182
0
01 Nov 2021
The Efficiency Misnomer
The Efficiency Misnomer
Daoyuan Chen
Liuyi Yao
Dawei Gao
Ashish Vaswani
Yaliang Li
34
98
0
25 Oct 2021
HRFormer: High-Resolution Transformer for Dense Prediction
HRFormer: High-Resolution Transformer for Dense Prediction
Yuhui Yuan
Rao Fu
Lang Huang
Weihong Lin
Chao Zhang
Xilin Chen
Jingdong Wang
ViT
38
227
0
18 Oct 2021
"Knights": First Place Submission for VIPriors21 Action Recognition
  Challenge at ICCV 2021
"Knights": First Place Submission for VIPriors21 Action Recognition Challenge at ICCV 2021
Ishan R. Dave
Naman Biyani
Brandon Clark
Rohit Gupta
Y. S. Rawat
M. Shah
ViT
22
3
0
14 Oct 2021
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
229
1,019
0
13 Oct 2021
Object-Region Video Transformers
Object-Region Video Transformers
Roei Herzig
Elad Ben-Avraham
K. Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
ViT
21
82
0
13 Oct 2021
TAda! Temporally-Adaptive Convolutions for Video Understanding
TAda! Temporally-Adaptive Convolutions for Video Understanding
Ziyuan Huang
Shiwei Zhang
Liang Pan
Zhiwu Qing
Mingqian Tang
Ziwei Liu
M. Ang
40
49
0
12 Oct 2021
Revitalizing CNN Attentions via Transformers in Self-Supervised Visual
  Representation Learning
Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning
Chongjian Ge
Youwei Liang
Yibing Song
Jianbo Jiao
Jue Wang
Ping Luo
ViT
21
36
0
11 Oct 2021
ActionCLIP: A New Paradigm for Video Action Recognition
ActionCLIP: A New Paradigm for Video Action Recognition
Mengmeng Wang
Jiazheng Xing
Yong Liu
VLM
152
362
0
17 Sep 2021
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Zhenzhi Wang
Liyu Wu
Zhimin Li
Jiangfeng Xiong
Qinglin Lu
19
4
0
16 Sep 2021
Scaled ReLU Matters for Training Vision Transformers
Scaled ReLU Matters for Training Vision Transformers
Pichao Wang
Xue Wang
Haowen Luo
Jingkai Zhou
Zhipeng Zhou
Fan Wang
Hao Li
R. L. Jin
19
41
0
08 Sep 2021
Class Semantics-based Attention for Action Detection
Class Semantics-based Attention for Action Detection
Deepak Sridhar
N. Quader
S. Muralidharan
Yaoxin Li
Peng Dai
Juwei Lu
18
66
0
06 Sep 2021
Revisiting 3D ResNets for Video Recognition
Revisiting 3D ResNets for Video Recognition
Xianzhi Du
Yeqing Li
Yin Cui
Rui Qian
Jing Li
Irwan Bello
51
17
0
03 Sep 2021
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Xuefan Zha
Wentao Zhu
Tingxun Lv
Sen Yang
Ji Liu
AI4TS
ViT
33
27
0
26 Aug 2021
StarVQA: Space-Time Attention for Video Quality Assessment
StarVQA: Space-Time Attention for Video Quality Assessment
Fengchuang Xing
Yuan-Gen Wang
Hanpin Wang
Leida Li
Guopu Zhu
ViT
14
22
0
22 Aug 2021
FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for
  Blind Face Inpainting
FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for Blind Face Inpainting
Junke Wang
Shaoxiang Chen
Zuxuan Wu
Yu-Gang Jiang
CVBM
19
24
0
10 Aug 2021
S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
S2^22-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
Tan Yu
Xu Li
Yunfeng Cai
Mingming Sun
Ping Li
39
50
0
02 Aug 2021
EAN: Event Adaptive Network for Enhanced Action Recognition
EAN: Event Adaptive Network for Enhanced Action Recognition
Yuan Tian
Yichao Yan
Guangtao Zhai
G. Guo
Zhiyong Gao
32
41
0
22 Jul 2021
CycleMLP: A MLP-like Architecture for Dense Prediction
CycleMLP: A MLP-like Architecture for Dense Prediction
Shoufa Chen
Enze Xie
Chongjian Ge
Runjian Chen
Ding Liang
Ping Luo
19
231
0
21 Jul 2021
Visual Parser: Representing Part-whole Hierarchies with Transformers
Visual Parser: Representing Part-whole Hierarchies with Transformers
Shuyang Sun
Xiaoyu Yue
S. Bai
Philip H. S. Torr
50
27
0
13 Jul 2021
What Makes for Hierarchical Vision Transformer?
What Makes for Hierarchical Vision Transformer?
Yuxin Fang
Xinggang Wang
Rui Wu
Wenyu Liu
ViT
18
9
0
05 Jul 2021
CSWin Transformer: A General Vision Transformer Backbone with
  Cross-Shaped Windows
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Xiaoyi Dong
Jianmin Bao
Dongdong Chen
Weiming Zhang
Nenghai Yu
Lu Yuan
Dong Chen
B. Guo
ViT
25
953
0
01 Jul 2021
Rethinking Token-Mixing MLP for MLP-based Vision Backbone
Rethinking Token-Mixing MLP for MLP-based Vision Backbone
Tan Yu
Xu Li
Yunfeng Cai
Mingming Sun
Ping Li
45
26
0
28 Jun 2021
Early Convolutions Help Transformers See Better
Early Convolutions Help Transformers See Better
Tete Xiao
Mannat Singh
Eric Mintun
Trevor Darrell
Piotr Dollár
Ross B. Girshick
20
752
0
28 Jun 2021
Can An Image Classifier Suffice For Action Recognition?
Can An Image Classifier Suffice For Action Recognition?
Quanfu Fan
Chun-Fu Chen
Chen
Rameswar Panda
ViT
29
33
0
26 Jun 2021
Video Swin Transformer
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng-Wei Zhang
Stephen Lin
Han Hu
ViT
44
1,444
0
24 Jun 2021
VOLO: Vision Outlooker for Visual Recognition
VOLO: Vision Outlooker for Visual Recognition
Li-xin Yuan
Qibin Hou
Zihang Jiang
Jiashi Feng
Shuicheng Yan
ViT
52
313
0
24 Jun 2021
IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision
  Transformers
IA-RED2^22: Interpretability-Aware Redundancy Reduction for Vision Transformers
Bowen Pan
Rameswar Panda
Yifan Jiang
Zhangyang Wang
Rogerio Feris
A. Oliva
VLM
ViT
39
153
0
23 Jun 2021
P2T: Pyramid Pooling Transformer for Scene Understanding
P2T: Pyramid Pooling Transformer for Scene Understanding
Yu-Huan Wu
Yun-Hai Liu
Xin Zhan
Mingg-Ming Cheng
ViT
29
219
0
22 Jun 2021
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Michael S. Ryoo
A. Piergiovanni
Anurag Arnab
Mostafa Dehghani
A. Angelova
ViT
32
127
0
21 Jun 2021
XCiT: Cross-Covariance Image Transformers
XCiT: Cross-Covariance Image Transformers
Alaaeldin El-Nouby
Hugo Touvron
Mathilde Caron
Piotr Bojanowski
Matthijs Douze
...
Ivan Laptev
Natalia Neverova
Gabriel Synnaeve
Jakob Verbeek
Hervé Jégou
ViT
36
497
0
17 Jun 2021
Previous
123...131415
Next