Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2112.01526
Cited By
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
2 December 2021
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"
50 / 398 papers shown
Title
Logos as a Well-Tempered Pre-train for Sign Language Recognition
Ilya Ovodov
Petr Surovtsev
Karina Kvanchiani
A. Kapitanov
Alexander Nagaev
14
0
0
15 May 2025
TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series
Xiaolei Qin
Di Wang
Jing Zhang
Fengxiang Wang
Xin Su
Bo Du
Liangpei Zhang
AI4TS
24
0
0
13 May 2025
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Wei Li
Ming Hu
Guoan Wang
Lihao Liu
Kaijin Zhou
Junzhi Ning
Xin Guo
Zongyuan Ge
Lixu Gu
Junjun He
28
0
0
12 May 2025
Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models
Mishal Fatima
Steffen Jung
M. Keuper
40
0
0
06 May 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
84
0
0
28 Apr 2025
Hierarchical and Multimodal Data for Daily Activity Understanding
Ghazal Kaviani
Yavuz Yarici
Seulgi Kim
Mohit Prabhushankar
Ghassan AlRegib
Mashhour Solh
Ameya Patil
54
0
0
24 Apr 2025
A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw
Wenwen Li
Chia-Yu Hsu
Sizhe Wang
Zhining Gu
Yili Yang
Brendan M. Rogers
A. Liljedahl
61
0
0
23 Apr 2025
Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis
Zhu Zhu
Shuo Jiang
Jingyuan Zheng
Yawen Li
Yifei Chen
Manli Zhao
Weizhong Gu
Feiwei Qin
Jinhu Wang
Gang Yu
MedIm
35
0
0
18 Apr 2025
Exploring Video-Based Driver Activity Recognition under Noisy Labels
Linjuan Fan
Di Wen
Kunyu Peng
Kailun Yang
J.N. Zhang
...
Yufan Chen
Junwei Zheng
Jiamin Wu
Xudong Han
Rainer Stiefelhagen
NoLa
49
0
0
16 Apr 2025
Action Anticipation from SoccerNet Football Video Broadcasts
Mohamad Dalal
Artur Xarles
A. Cioppa
Silvio Giancola
Marc Van Droogenbroeck
Bernard Ghanem
Albert Clapés
Sergio Escalera
T. Moeslund
AI4TS
36
0
0
16 Apr 2025
DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction
Kiana Hoshanfar
Alireza Hosseini
Ahmad Kalhor
Babak Nadjar Araabi
130
0
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
70
12
1
14 Apr 2025
Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning
Kyle Stein
A. Mahyari
Guillermo Francia III
Eman El-Sheikh
CLL
65
0
0
11 Apr 2025
Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset
Shiao Wang
Xihuai Wang
Bo Jiang
Lin Zhu
G. Li
Yali Wang
Yonghong Tian
Jin Tang
145
0
0
08 Apr 2025
Towards Generalizing Temporal Action Segmentation to Unseen Views
Emad Bahrami
Olga Zatsarynna
Gianpiero Francesca
Juergen Gall
EgoV
43
0
0
03 Apr 2025
SocialGesture: Delving into Multi-person Gesture Understanding
Xu Cao
Pranav Virupaksha
Wenqi Jia
Bolin Lai
Fiona Ryan
Sangmin Lee
James M. Rehg
SLR
56
0
0
03 Apr 2025
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Bernard Ghanem
59
0
0
01 Apr 2025
Multi-Task Learning for Extracting Menstrual Characteristics from Clinical Notes
Anna Shopova
Cristoph Lippert
Leslee J. Shaw
Eugenia Alleva
47
0
0
31 Mar 2025
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Junzhu Mao
Yang Shen
Jinyang Guo
Yazhou Yao
Xiansheng Hua
ViT
36
0
0
30 Mar 2025
OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition
Shihao Cheng
Jinlu Zhang
Yue Liu
Zhigang Tu
VLM
39
0
0
30 Mar 2025
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Jonathan Attard
Dylan Seychell
48
0
0
27 Mar 2025
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Jiaheng Zhou
Yanfeng Zhou
Wei Fang
Yuxing Tang
Le Lu
Ge Yang
Mamba
205
0
0
26 Mar 2025
Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings
Chengan Che
Chao Wang
Tom Vercauteren
Sophia Tsoka
Luis C. García-Peraza-Herrera
MedIm
46
0
0
25 Mar 2025
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Wencheng Zhu
Yuexin Wang
Hongxuan Li
Pengfei Zhu
Q. Hu
CLIP
48
0
0
24 Mar 2025
Beyond Accuracy: What Matters in Designing Well-Behaved Models?
Robin Hesse
Doğukan Bağcı
Bernt Schiele
Simone Schaub-Meyer
Stefan Roth
VLM
62
0
0
21 Mar 2025
Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
Chi Hsuan Wu
Kumar Ashutosh
Kristen Grauman
DiffM
63
0
0
18 Mar 2025
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
Shristi Das Biswas
Efstathia Soufleri
Arani Roy
Kaushik Roy
59
0
0
17 Mar 2025
Towards Fast, Memory-based and Data-Efficient Vision-Language Policy
Haoxuan Li
Sixu Yan
Yongqian Li
Xinggang Wang
LM&Ro
64
0
0
13 Mar 2025
PromptGAR: Flexible Promptive Group Activity Recognition
Zhangyu Jin
Andrew Feng
Ankur Chemburkar
Celso M. De Melo
VLM
42
0
0
11 Mar 2025
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup
Seokun Kang
Taehwan Kim
42
0
0
04 Mar 2025
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
Antoni Bigata
Michał Stypułkowski
Rodrigo Mira
Stella Bounareli
Konstantinos Vougioukas
Zoe Landgraf
Nikita Drobyshev
Maciej Ziȩba
Stavros Petridis
M. Pantic
DiffM
VGen
65
2
0
03 Mar 2025
An Efficient Approach to Detecting Lung Nodules Using Swin Transformer
Saeed Shakuri
Alireza Rezvanian
ViT
MedIm
45
1
0
03 Mar 2025
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
Otto Brookes
Maksim Kukushkin
Majid Mirmehdi
Colleen Stephens
Paula Dieguez
...
Lukas Boesch
Thomas Schmid
M. Arandjelovic
H. Kühl
T. Burghardt
48
0
0
28 Feb 2025
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
Shuming Liu
Chen Zhao
Fatimah Zohra
Mattia Soldan
Alejandro Pardo
...
Juan Carlos León Alcázar
A. Cioppa
Silvio Giancola
Carlos Hinojosa
Bernard Ghanem
68
3
0
27 Feb 2025
Hierarchical Context Transformer for Multi-level Semantic Scene Understanding
Luoying Hao
Yan Hu
Yang Yue
Li Wu
Huazhu Fu
Jinming Duan
Jiang Liu
68
0
0
24 Feb 2025
iFormer: Integrating ConvNet and Transformer for Mobile Application
Chuanyang Zheng
ViT
72
0
0
26 Jan 2025
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha
Monish Soundar Raj
Pu Wang
Ahmed Helmy
Srijan Das
Mamba
53
3
0
10 Jan 2025
Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition
Mallika Garg
Debashis Ghosh
P. M. Pradhan
SLR
32
16
0
03 Jan 2025
Breaking the Context Bottleneck on Long Time Series Forecasting
Chao Ma
Yikai Hou
Xiang Li
Yinggang Sun
Haining Yu
Zhou Fang
Jiaxing Qu
AI4TS
72
0
0
21 Dec 2024
ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition
Seungdong Yoa
Seungjun Lee
Hyeseung Cho
Bumsoo Kim
Woohyung Lim
ViT
70
0
0
21 Dec 2024
Training Strategies for Isolated Sign Language Recognition
Karina Kvanchiani
Roman Kraynov
Elizaveta Petrova
Petr Surovcev
Aleksandr Nagaev
A. Kapitanov
76
1
0
16 Dec 2024
Bridging the Divide: Reconsidering Softmax and Linear Attention
Dongchen Han
Yifan Pu
Zhuofan Xia
Yizeng Han
Xuran Pan
Xiu Li
Jiwen Lu
Shiji Song
Gao Huang
73
8
0
09 Dec 2024
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
X. Zhang
Zecheng Tang
Zhipei Xu
Runyi Li
Youmin Xu
Bin Chen
Feng Gao
Jian Andrew Zhang
WIGM
93
4
0
02 Dec 2024
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions
Guanyu Zhou
Xiaohan Yu
Wenxin Huang
Xuemei Jia
Xian Zhong
Chia-Wen Lin
CML
81
0
0
24 Nov 2024
Learning Collective Dynamics of Multi-Agent Systems using Event-based Vision
Minah Lee
Uday Kamal
Saibal Mukhopadhyay
25
0
0
11 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
26
10
0
07 Nov 2024
AM Flow: Adapters for Temporal Processing in Action Recognition
Tanay Agrawal
Abid Ali
A. Dantcheva
François Brémond
39
0
0
04 Nov 2024
HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
Zirui Wang
Xinran Zhao
Simon Stepputtis
Woojun Kim
Tongshuang Wu
Katia P. Sycara
Yaqi Xie
OffRL
49
0
0
03 Nov 2024
Video Token Merging for Long-form Video Understanding
Seon-Ho Lee
Jue Wang
Zhikang Zhang
D. Fan
Xinyu Li
42
5
0
31 Oct 2024
Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context
Manuel Benavent-Lledo
David Mulero-Pérez
David Ortiz-Perez
José García Rodríguez
Antonis Argyros
24
0
0
28 Oct 2024
1
2
3
4
5
6
7
8
Next