ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.01526
  4. Cited By
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2 December 2021
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
    ViT
ArXivPDFHTML

Papers citing "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"

50 / 398 papers shown
Title
Vision Transformer with Sparse Scan Prior
Vision Transformer with Sparse Scan Prior
Qihang Fan
Huaibo Huang
Mingrui Chen
Ran He
ViT
48
5
0
22 May 2024
Generative Artificial Intelligence: A Systematic Review and Applications
Generative Artificial Intelligence: A Systematic Review and Applications
S. S. Sengar
Affan Bin Hasan
Sanjay Kumar
Fiona Carroll
MedIm
36
51
0
17 May 2024
No Time to Waste: Squeeze Time into Channel for Mobile Video
  Understanding
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding
Yingjie Zhai
Wenshuo Li
Yehui Tang
Xinghao Chen
Yunhe Wang
ViT
27
0
0
14 May 2024
MambaOut: Do We Really Need Mamba for Vision?
MambaOut: Do We Really Need Mamba for Vision?
Weihao Yu
Xinchao Wang
Mamba
50
48
0
13 May 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
49
533
0
25 Apr 2024
Mamba-360: Survey of State Space Models as Transformer Alternative for
  Long Sequence Modelling: Methods, Applications, and Challenges
Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges
Badri N. Patro
Vijay Srinivas Agneeswaran
Mamba
46
38
0
24 Apr 2024
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation
Abhishek Aich
Yumin Suh
S. Schulter
Manmohan Chandraker
56
0
0
23 Apr 2024
Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature
  Processing
Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing
Yuang Liu
Zhiheng Qiu
Xiaokai Qin
ViT
33
0
0
20 Apr 2024
An Experimental Study on Exploring Strong Lightweight Vision
  Transformers via Masked Image Modeling Pre-Training
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training
Jin Gao
Shubo Lin
Shaoru Wang
Yutong Kou
Zeming Li
Liang Li
Congxuan Zhang
Xiaoqin Zhang
Yizheng Wang
Weiming Hu
47
1
0
18 Apr 2024
Simultaneous Detection and Interaction Reasoning for Object-Centric
  Action Recognition
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition
Xunsong Li
Pengzhan Sun
Yangcen Liu
Lixin Duan
Wen Li
43
3
0
18 Apr 2024
GeoAI Reproducibility and Replicability: a computational and spatial
  perspective
GeoAI Reproducibility and Replicability: a computational and spatial perspective
Wenwen Li
Chia-Yu Hsu
Sizhe Wang
Peter Kedron
AI4CE
28
6
0
15 Apr 2024
ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition
ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition
Otto Brookes
Majid Mirmehdi
H. Kühl
T. Burghardt
35
3
0
13 Apr 2024
X-VARS: Introducing Explainability in Football Refereeing with
  Multi-Modal Large Language Model
X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
Jan Held
Hani Itani
A. Cioppa
Silvio Giancola
Guohao Li
Marc Van Droogenbroeck
38
16
0
07 Apr 2024
Learning Correlation Structures for Vision Transformers
Learning Correlation Structures for Vision Transformers
Manjin Kim
Paul Hongsuck Seo
Cordelia Schmid
Minsu Cho
ViT
40
7
0
05 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Jienneg Chen
Qihang Yu
Xiaohui Shen
Alan L. Yuille
Liang-Chieh Chen
3DV
VLM
36
24
0
02 Apr 2024
LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause
  Pair Extraction as Sequence Labelling Task
LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task
Suyash Vardhan Mathur
Akshett Rai Jindal
Hardik Mittal
Manish Shrivastava
33
1
0
02 Apr 2024
Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
Hyeongjun Kwon
Jinhyun Jang
Jin-Hwa Kim
Kwonyoung Kim
Kwanghoon Sohn
43
1
0
01 Apr 2024
Slightly Shift New Classes to Remember Old Classes for Video
  Class-Incremental Learning
Slightly Shift New Classes to Remember Old Classes for Video Class-Incremental Learning
Jian Jiao
Yu Dai
Hefei Mei
Heqian Qiu
Chuanyang Gong
Shiyuan Tang
Xinpeng Hao
Hongliang Li
CLL
VLM
33
0
0
01 Apr 2024
Benchmarking Object Detectors with COCO: A New Path Forward
Benchmarking Object Detectors with COCO: A New Path Forward
Shweta Singh
Aayan Yadav
Jitesh Jain
Humphrey Shi
Justin Johnson
Karan Desai
28
6
0
27 Mar 2024
Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and
  Time-Series Analysis
Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis
Badri N. Patro
Suhas Ranganath
Vinay P. Namboodiri
Vijay Srinivas Agneeswaran
43
2
0
26 Mar 2024
OmniVid: A Generative Framework for Universal Video Understanding
OmniVid: A Generative Framework for Universal Video Understanding
Junke Wang
Dongdong Chen
Chong Luo
Bo He
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
VLM
VGen
71
14
0
26 Mar 2024
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
Chenhongyi Yang
Zehui Chen
Miguel Espinosa
Linus Ericsson
Zhenyu Wang
Jiaming Liu
Elliot J. Crowley
Mamba
39
88
0
26 Mar 2024
Activity-Biometrics: Person Identification from Daily Activities
Activity-Biometrics: Person Identification from Daily Activities
Shehreen Azad
Y. S. Rawat
29
3
0
26 Mar 2024
Enhancing Video Transformers for Action Understanding with VLM-aided
  Training
Enhancing Video Transformers for Action Understanding with VLM-aided Training
Hui Lu
Hu Jian
Ronald Poppe
A. A. Salah
42
1
0
24 Mar 2024
PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for
  Faster Inference
PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
Tanvir Mahmud
Burhaneddin Yaman
Chun-Hao Liu
Diana Marculescu
38
2
0
24 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLM
AI4TS
58
4
0
21 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
56
7
0
21 Mar 2024
Don't Judge by the Look: Towards Motion Coherent Video Representation
Don't Judge by the Look: Towards Motion Coherent Video Representation
Yitian Zhang
Yue Bai
Huan Wang
Yizhou Wang
Yun Fu
35
0
0
14 Mar 2024
Pig aggression classification using CNN, Transformers and Recurrent
  Networks
Pig aggression classification using CNN, Transformers and Recurrent Networks
Junior Silva Souza
Eduardo Bedin
G. Higa
Newton Loebens
H. Pistori
32
0
0
13 Mar 2024
DiffSal: Joint Audio and Video Learning for Diffusion Saliency
  Prediction
DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
Jun Xiong
Peng Zhang
Tao You
Chuanyue Li
Wei Huang
Yufei Zha
DiffM
32
5
0
02 Mar 2024
FViT: A Focal Vision Transformer with Gabor Filter
FViT: A Focal Vision Transformer with Gabor Filter
Yulong Shi
Mingwei Sun
Yongshuai Wang
Rui Wang
57
4
0
17 Feb 2024
What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised
  Generic Event Boundary Detection
What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection
Sourabh Vasant Gothe
Vibhav Agarwal
Sourav Ghosh
Jayesh Rajkumar Vachhani
Pranay Kashyap
Barath Raj Kandur
27
2
0
15 Feb 2024
Subgraphormer: Unifying Subgraph GNNs and Graph Transformers via Graph
  Products
Subgraphormer: Unifying Subgraph GNNs and Graph Transformers via Graph Products
Guy Bar-Shalom
Beatrice Bevilacqua
Haggai Maron
AI4CE
35
6
0
13 Feb 2024
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
Shufan Li
Harkanwar Singh
Aditya Grover
Mamba
95
56
0
08 Feb 2024
Memory Consolidation Enables Long-Context Video Understanding
Memory Consolidation Enables Long-Context Video Understanding
Ivana Balavzević
Yuge Shi
Pinelopi Papalampidi
Rahma Chaabouni
Skanda Koppula
Olivier J. Hénaff
102
22
0
08 Feb 2024
SISP: A Benchmark Dataset for Fine-grained Ship Instance Segmentation in
  Panchromatic Satellite Images
SISP: A Benchmark Dataset for Fine-grained Ship Instance Segmentation in Panchromatic Satellite Images
Pengming Feng
Mingjie Xie
Hongning Liu
Xuanjia Zhao
Guangjun He
Xueliang Zhang
Jian Guan
27
1
0
06 Feb 2024
SAM-based instance segmentation models for the automation of structural
  damage detection
SAM-based instance segmentation models for the automation of structural damage detection
Zehao Ye
Lucy Lovell
A. Faramarzi
Jelena Ninić
16
13
0
27 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
22
7
0
25 Jan 2024
PanAf20K: A Large Video Dataset for Wild Ape Detection and Behaviour
  Recognition
PanAf20K: A Large Video Dataset for Wild Ape Detection and Behaviour Recognition
Otto Brookes
Majid Mirmehdi
Colleen Stephens
Samuel Angedakin
Katherine Corogenes
...
Klaus Zuberbühler
Christophe Boesch
M. Arandjelovic
H. Kühl
T. Burghardt
35
13
0
24 Jan 2024
WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing
WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing
Shuokang Huang
Kaihan Li
Di You
Yichong Chen
Arvin Lin
Siying Liu
Xiaohui Li
Julie A. McCann
30
6
0
24 Jan 2024
UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain
  Adaptation
UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain Adaptation
Hengjia Li
Yang Liu
Yuqi Lin
Zhanwei Zhang
Yibo Zhao
...
Tu Zheng
Zheng Yang
Yuchun Jiang
Boxi Wu
Deng Cai
DiffM
36
0
0
23 Jan 2024
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action
  Recognition
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition
Mengmeng Wang
Jiazheng Xing
Boyuan Jiang
Jun Chen
Jianbiao Mei
Xingxing Zuo
Guang Dai
Jingdong Wang
Yong-Jin Liu
VLM
28
4
0
22 Jan 2024
Segment Anything Model Can Not Segment Anything: Assessing AI Foundation
  Model's Generalizability in Permafrost Mapping
Segment Anything Model Can Not Segment Anything: Assessing AI Foundation Model's Generalizability in Permafrost Mapping
Wenwen Li
Chia-Yu Hsu
Sizhe Wang
Yezhou Yang
Hyunho Lee
...
Brendan M. Rogers
S. Arundel
Matthew B. Jones
Kenton McHenry
Patricia Solis
VLM
39
13
0
16 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
  Classification
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
41
5
0
08 Jan 2024
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
  Audio-Video Classification
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
37
4
0
08 Jan 2024
Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN
  Ticket
Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket
Zhaokun Zhou
Kaiwei Che
Wei Fang
Keyu Tian
Yuesheng Zhu
Shuicheng Yan
Yonghong Tian
Liuliang Yuan
ViT
41
28
0
04 Jan 2024
Detours for Navigating Instructional Videos
Detours for Navigating Instructional Videos
Kumar Ashutosh
Zihui Xue
Tushar Nagarajan
Kristen Grauman
29
6
0
03 Jan 2024
SVFAP: Self-supervised Video Facial Affect Perceiver
SVFAP: Self-supervised Video Facial Affect Perceiver
Guoying Zhao
Zheng Lian
Kexin Wang
Yu He
Ming Xu
Haiyang Sun
Bin Liu
Jianhua Tao
56
14
0
31 Dec 2023
Multiscale Vision Transformers meet Bipartite Matching for efficient
  single-stage Action Localization
Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization
Ioanna Ntinou
Enrique Sanchez
Georgios Tzimiropoulos
49
4
0
29 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
176
924
0
21 Dec 2023
Previous
12345678
Next