ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.01526
  4. Cited By
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2 December 2021
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
    ViT
ArXivPDFHTML

Papers citing "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"

50 / 398 papers shown
Title
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLM
AI4TS
173
69
0
30 Dec 2022
Transformers in Action Recognition: A Review on Temporal Modeling
Transformers in Action Recognition: A Review on Temporal Modeling
Elham Shabaninia
Hossein Nezamabadi-pour
Fatemeh Shafizadegan
ViT
24
8
0
29 Dec 2022
Detecting Objects with Context-Likelihood Graphs and Graph Refinement
Detecting Objects with Context-Likelihood Graphs and Graph Refinement
Aritra Bhowmik
Yu Wang
N. Baka
Martin R. Oswald
Cees G. M. Snoek
24
2
0
23 Dec 2022
MAViL: Masked Audio-Video Learners
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
26
51
0
15 Dec 2022
Rethinking Vision Transformers for MobileNet Size and Speed
Rethinking Vision Transformers for MobileNet Size and Speed
Yanyu Li
Ju Hu
Yang Wen
Georgios Evangelidis
Kamyar Salahi
Yanzhi Wang
Sergey Tulyakov
Jian Ren
ViT
35
159
0
15 Dec 2022
GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group
  Propagation
GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation
Chenhongyi Yang
Jiarui Xu
Shalini De Mello
Elliot J. Crowley
Xinyu Wang
ViT
38
21
0
13 Dec 2022
Masked Video Distillation: Rethinking Masked Feature Modeling for
  Self-supervised Video Representation Learning
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Lu Yuan
Yu-Gang Jiang
VGen
32
87
0
08 Dec 2022
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers
  using Synthetic Scene Data
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Roei Herzig
Ofir Abramovich
Elad Ben-Avraham
Assaf Arbelle
Leonid Karlinsky
Ariel Shamir
Trevor Darrell
Amir Globerson
41
16
0
08 Dec 2022
ViTPose++: Vision Transformer for Generic Body Pose Estimation
ViTPose++: Vision Transformer for Generic Body Pose Estimation
Yufei Xu
Jing Zhang
Qiming Zhang
Dacheng Tao
ViT
42
40
0
07 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
36
54
0
06 Dec 2022
VLG: General Video Recognition with Web Textual Knowledge
VLG: General Video Recognition with Web Textual Knowledge
Jintao Lin
Zhaoyang Liu
Wenhai Wang
Wayne Wu
Limin Wang
39
0
0
03 Dec 2022
Test-Time Mixup Augmentation for Data and Class-Specific Uncertainty
  Estimation in Deep Learning Image Classification
Test-Time Mixup Augmentation for Data and Class-Specific Uncertainty Estimation in Deep Learning Image Classification
Han S. Lee
Haeil Lee
H. Hong
Junmo Kim
UQCV
25
0
0
01 Dec 2022
Lightweight Structure-Aware Attention for Visual Understanding
Lightweight Structure-Aware Attention for Visual Understanding
Heeseung Kwon
F. M. Castro
M. Marín-Jiménez
N. Guil
Alahari Karteek
28
2
0
29 Nov 2022
Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal
  Action Localization
Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
Chen Zhao
Shuming Liu
K. Mangalam
Guohao Li
38
17
0
25 Nov 2022
Query Efficient Cross-Dataset Transferable Black-Box Attack on Action
  Recognition
Query Efficient Cross-Dataset Transferable Black-Box Attack on Action Recognition
Rohit Gupta
Naveed Akhtar
Gaurav Kumar Nayak
Ajmal Mian
M. Shah
AAML
34
1
0
23 Nov 2022
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token
  Migration
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration
Yunjie Tian
Lingxi Xie
Jihao Qiu
Jianbin Jiao
Yaowei Wang
Qi Tian
Qixiang Ye
ViT
39
6
0
23 Nov 2022
U-Flow: A U-shaped Normalizing Flow for Anomaly Detection with
  Unsupervised Threshold
U-Flow: A U-shaped Normalizing Flow for Anomaly Detection with Unsupervised Threshold
Matías Tailanián
Álvaro Pardo
Pablo Musé
30
18
0
22 Nov 2022
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition
Qibin Hou
Cheng Lu
Mingg-Ming Cheng
Jiashi Feng
ViT
34
129
0
22 Nov 2022
Vision Transformer with Super Token Sampling
Vision Transformer with Super Token Sampling
Huaibo Huang
Xiaoqiang Zhou
Jie Cao
Ran He
Tieniu Tan
ViT
23
56
0
21 Nov 2022
Unifying Tracking and Image-Video Object Detection
Unifying Tracking and Image-Video Object Detection
Peirong Liu
Rui Wang
Pengchuan Zhang
Omid Poursaeed
Yipin Zhou
Xuefei Cao
Sreya . Dutta Roy
Ashish Shah
Ser-Nam Lim
21
0
0
20 Nov 2022
Castling-ViT: Compressing Self-Attention via Switching Towards
  Linear-Angular Attention at Vision Transformer Inference
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
Haoran You
Yunyang Xiong
Xiaoliang Dai
Bichen Wu
Peizhao Zhang
Haoqi Fan
Peter Vajda
Yingyan Lin
35
32
0
18 Nov 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video
  UniFormer
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
Limin Wang
Yu Qiao
ViT
30
107
0
17 Nov 2022
EVA: Exploring the Limits of Masked Visual Representation Learning at
  Scale
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
CLIP
87
675
0
14 Nov 2022
Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
Haokui Zhang
Wenze Hu
Xiaoyu Wang
ViT
19
8
0
14 Nov 2022
MARLIN: Masked Autoencoder for facial video Representation LearnINg
MARLIN: Masked Autoencoder for facial video Representation LearnINg
Zhixi Cai
Shreya Ghosh
Kalin Stefanov
Abhinav Dhall
Jianfei Cai
Hamid Rezatofighi
Reza Haffari
Munawar Hayat
ViT
CVBM
27
60
0
12 Nov 2022
PatchBlender: A Motion Prior for Video Transformers
PatchBlender: A Motion Prior for Video Transformers
Gabriele Prato
Yale Song
Janarthanan Rajendran
R. Devon Hjelm
Neel Joshi
Sarath Chandar
ViT
27
0
0
11 Nov 2022
MAST: Multiscale Audio Spectrogram Transformers
MAST: Multiscale Audio Spectrogram Transformers
Sreyan Ghosh
Ashish Seth
S. Umesh
Tianyi Zhou
22
3
0
02 Nov 2022
Grafting Vision Transformers
Grafting Vision Transformers
Jong Sung Park
Kumara Kahatapitiya
Donghyun Kim
Shivchander Sudalairaj
Quanfu Fan
Michael S. Ryoo
ViT
29
2
0
28 Oct 2022
MetaFormer Baselines for Vision
MetaFormer Baselines for Vision
Weihao Yu
Chenyang Si
Pan Zhou
Mi Luo
Yichen Zhou
Jiashi Feng
Shuicheng Yan
Xinchao Wang
MoE
40
156
0
24 Oct 2022
S2WAT: Image Style Transfer via Hierarchical Vision Transformer using
  Strips Window Attention
S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention
Chi Zhang
Lu Zhou
Lei Wang
Zaiyan Dai
Jun Yang
ViT
34
23
0
22 Oct 2022
Play It Back: Iterative Attention for Audio Recognition
Play It Back: Iterative Attention for Audio Recognition
Alexandros Stergiou
Dima Damen
37
4
0
20 Oct 2022
Token Merging: Your ViT But Faster
Token Merging: Your ViT But Faster
Daniel Bolya
Cheng-Yang Fu
Xiaoliang Dai
Peizhao Zhang
Christoph Feichtenhofer
Judy Hoffman
MoMe
51
417
0
17 Oct 2022
Self-supervised Video Representation Learning with Motion-Aware Masked
  Autoencoders
Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Haosen Yang
Deng Huang
Bin Wen
Jiannan Wu
H. Yao
Yi-Xin Jiang
Xiatian Zhu
Zehuan Yuan
37
19
0
09 Oct 2022
Centralized Feature Pyramid for Object Detection
Centralized Feature Pyramid for Object Detection
Yu Quan
Dong Zhang
Liyan Zhang
Jinhui Tang
ObjD
31
148
0
05 Oct 2022
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision
  Models
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Chenglin Yang
Siyuan Qiao
Qihang Yu
Xiaoding Yuan
Yukun Zhu
Alan Yuille
Hartwig Adam
Liang-Chieh Chen
ViT
MoE
39
58
0
04 Oct 2022
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and
  Effective Fusion of Local, Global and Input Features
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features
S. Wadekar
Abhishek Chaurasia
ViT
103
87
0
30 Sep 2022
3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical
  Transformer for Medical Image Segmentation
3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation
Ho Hin Lee
Shunxing Bao
Yuankai Huo
Bennett A. Landman
OOD
MedIm
56
133
0
29 Sep 2022
Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain
  Supervision for Domain-adaptive Action Detection
Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection
Yifan Lu
Gurkirt Singh
Suman Saha
Luc Van Gool
TTA
37
2
0
28 Sep 2022
RALACs: Action Recognition in Autonomous Vehicles using Interaction
  Encoding and Optical Flow
RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow
Eddy Zhou
Alex Zhuang
Alikasim Budhwani
Owen Leather
Rowan Dempster
Quanquan Li
Mohammad K. Al-Sharman
Derek Rayside
W. Melek
29
0
0
28 Sep 2022
Multi-dataset Training of Transformers for Robust Action Recognition
Multi-dataset Training of Transformers for Robust Action Recognition
Junwei Liang
Enwei Zhang
Jun Zhang
Chunhua Shen
ViT
42
11
0
26 Sep 2022
Hydra Attention: Efficient Attention with Many Heads
Hydra Attention: Efficient Attention with Many Heads
Daniel Bolya
Cheng-Yang Fu
Xiaoliang Dai
Peizhao Zhang
Judy Hoffman
99
77
0
15 Sep 2022
On the Surprising Effectiveness of Transformers in Low-Labeled Video
  Recognition
On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition
Farrukh Rahman
Ömer Mubarek
Z. Kira
ViT
18
2
0
15 Sep 2022
Spatio-Temporal Action Detection Under Large Motion
Spatio-Temporal Action Detection Under Large Motion
Gurkirt Singh
Vasileios Choutas
Suman Saha
Feng Yu
Luc Van Gool
18
12
0
06 Sep 2022
Transformers in Remote Sensing: A Survey
Transformers in Remote Sensing: A Survey
Abdulaziz Amer Aleissaee
Amandeep Kumar
Rao Muhammad Anwer
Salman Khan
Hisham Cholakkal
Guisong Xia
Fahad Shahbaz Khan
ViT
57
175
0
02 Sep 2022
MAFormer: A Transformer Network with Multi-scale Attention Fusion for
  Visual Recognition
MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition
Y. Wang
H. Sun
Xiaodi Wang
Bin Zhang
Chaonan Li
Ying Xin
Baochang Zhang
Errui Ding
Shumin Han
ViT
31
9
0
31 Aug 2022
Video Mobile-Former: Video Recognition with Efficient Global
  Spatial-temporal Modeling
Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Rui Wang
Zuxuan Wu
Dongdong Chen
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Luowei Zhou
Lu Yuan
Yu-Gang Jiang
ViT
40
4
0
25 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
51
629
0
22 Aug 2022
Improved Image Classification with Token Fusion
Improved Image Classification with Token Fusion
Keong-Hun Choi
Jin-Woo Kim
Yaolong Wang
J. Ha
ViT
19
0
0
19 Aug 2022
In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze
  Estimation
In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation
Bolin Lai
Miao Liu
Fiona Ryan
James M. Rehg
ViT
40
33
0
08 Aug 2022
Frozen CLIP Models are Efficient Video Learners
Frozen CLIP Models are Efficient Video Learners
Ziyi Lin
Shijie Geng
Renrui Zhang
Peng Gao
Gerard de Melo
Xiaogang Wang
Jifeng Dai
Yu Qiao
Hongsheng Li
CLIP
VLM
16
200
0
06 Aug 2022
Previous
12345678
Next