ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.09626
  4. Cited By
Video Mamba Suite: State Space Model as a Versatile Alternative for
  Video Understanding

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

14 March 2024
Guo Chen
Yifei Huang
Jilan Xu
Baoqi Pei
Zhe Chen
Zhiqi Li
Jiahao Wang
Kunchang Li
Tong Lu
Limin Wang
    Mamba
ArXivPDFHTML

Papers citing "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding"

50 / 50 papers shown
Title
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
Haoran Tang
Meng Cao
Jinfa Huang
Ruyang Liu
Peng Jin
Ge Li
Xiaodan Liang
Mamba
143
4
0
24 Feb 2025
Linear Attention Modeling for Learned Image Compression
Linear Attention Modeling for Learned Image Compression
Donghui Feng
Zhengxue Cheng
Shen Wang
Ronghua Wu
Hongwei Hu
Guo Lu
Li Song
287
1
0
09 Feb 2025
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha
Monish Soundar Raj
Pu Wang
Ahmed Helmy
Srijan Das
Mamba
95
3
0
10 Jan 2025
VMamba: Visual State Space Model
VMamba: Visual State Space Model
Yue Liu
Yunjie Tian
Yuzhong Zhao
Hongtian Yu
Lingxi Xie
Yaowei Wang
Qixiang Ye
Jianbin Jiao
Yunfan Liu
Mamba
241
683
0
31 Dec 2024
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Yuanmin Huang
Jilan Xu
Baoqi Pei
Yuping He
Guo Chen
...
Kunpeng Li
C. Yuan
Yidan Wang
Yu Qiao
L. Wang
114
6
0
31 Dec 2024
Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion
Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion
Chaodong Xiao
Minghan Li
Zhengqiang Zhang
Deyu Meng
Lei Zhang
Mamba
116
5
0
19 Oct 2024
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Yuchen Duan
Weiyun Wang
Zhe Chen
Xizhou Zhu
Lewei Lu
Tong Lu
Yu Qiao
Hongsheng Li
Jifeng Dai
Wenhai Wang
ViT
60
45
0
04 Mar 2024
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
  Understanding
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
77
287
0
17 Aug 2023
VideoLLM: Modeling Video Sequence with Large Language Models
VideoLLM: Modeling Video Sequence with Large Language Models
Guo Chen
Yin-Dong Zheng
Jiahao Wang
Jilan Xu
Yifei Huang
...
Yi Wang
Yali Wang
Yu Qiao
Tong Lu
Limin Wang
MLLM
118
83
0
22 May 2023
RWKV: Reinventing RNNs for the Transformer Era
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng
Eric Alcaide
Quentin G. Anthony
Alon Albalak
Samuel Arcadinho
...
Qihang Zhao
P. Zhou
Qinghua Zhou
Jian Zhu
Rui-Jie Zhu
198
593
0
22 May 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
93
166
0
28 Mar 2023
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
104
325
0
06 Dec 2022
Long Range Language Modeling via Gated State Spaces
Long Range Language Modeling via Gated State Spaces
Harsh Mehta
Ankit Gupta
Ashok Cutkosky
Behnam Neyshabur
Mamba
76
238
0
27 Jun 2022
On the Parameterization and Initialization of Diagonal State Space
  Models
On the Parameterization and Initialization of Diagonal State Space Models
Albert Gu
Ankit Gupta
Karan Goel
Christopher Ré
71
314
0
23 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
127
236
0
16 Jun 2022
ActionFormer: Localizing Moments of Actions with Transformers
ActionFormer: Localizing Moments of Actions with Transformers
Chen-Da Liu-Zhang
Jianxin Wu
Yin Li
ViT
59
340
0
16 Feb 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal
  Representation Learning
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
Kunchang Li
Yali Wang
Peng Gao
Guanglu Song
Yu Liu
Hongsheng Li
Yu Qiao
ViT
114
249
0
12 Jan 2022
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
427
7,705
0
11 Nov 2021
ASFormer: Transformer for Action Segmentation
ASFormer: Transformer for Action Segmentation
Fangqiu Yi
Hongyu Wen
Tingting Jiang
ViT
108
176
0
16 Oct 2021
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
363
1,081
0
13 Oct 2021
End-to-End Dense Video Captioning with Parallel Decoding
End-to-End Dense Video Captioning with Parallel Decoding
Teng Wang
Ruimao Zhang
Zhichao Lu
Feng Zheng
Ran Cheng
Ping Luo
3DV
69
183
0
17 Aug 2021
XCiT: Cross-Covariance Image Transformers
XCiT: Cross-Covariance Image Transformers
Alaaeldin El-Nouby
Hugo Touvron
Mathilde Caron
Piotr Bojanowski
Matthijs Douze
...
Ivan Laptev
Natalia Neverova
Gabriel Synnaeve
Jakob Verbeek
Hervé Jégou
ViT
131
510
0
17 Jun 2021
BEiT: BERT Pre-Training of Image Transformers
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao
Li Dong
Songhao Piao
Furu Wei
ViT
223
2,812
0
15 Jun 2021
Multiscale Vision Transformers
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
127
1,257
0
22 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
133
1,172
0
01 Apr 2021
ViViT: A Video Vision Transformer
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
193
2,137
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
404
21,347
0
25 Mar 2021
An Image is Worth 16x16 Words, What is a Video Worth?
An Image is Worth 16x16 Words, What is a Video Worth?
Gilad Sharir
Asaf Noy
Lihi Zelnik-Manor
ViT
62
125
0
25 Mar 2021
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
  without Convolutions
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang
Enze Xie
Xiang Li
Deng-Ping Fan
Kaitao Song
Ding Liang
Tong Lu
Ping Luo
Ling Shao
ViT
491
3,709
0
24 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
362
2,039
0
09 Feb 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
347
6,731
0
23 Dec 2020
TDN: Temporal Difference Networks for Efficient Action Recognition
TDN: Temporal Difference Networks for Efficient Action Recognition
Limin Wang
Zhan Tong
Bin Ji
Gangshan Wu
75
396
0
18 Dec 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
194
5,046
0
08 Oct 2020
Rethinking Attention with Performers
Rethinking Attention with Performers
K. Choromanski
Valerii Likhosherstov
David Dohan
Xingyou Song
Andreea Gane
...
Afroz Mohiuddin
Lukasz Kaiser
David Belanger
Lucy J. Colwell
Adrian Weller
167
1,570
0
30 Sep 2020
Rescaling Egocentric Vision
Rescaling Egocentric Vision
Dima Damen
Hazel Doughty
G. Farinella
Antonino Furnari
Evangelos Kazakos
...
Davide Moltisanti
Jonathan Munro
Toby Perrett
Will Price
Michael Wray
EgoV
52
454
0
23 Jun 2020
X3D: Expanding Architectures for Efficient Video Recognition
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
125
1,018
0
09 Apr 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
371
20,053
0
23 Oct 2019
Video Classification with Channel-Separated Convolutional Networks
Video Classification with Channel-Separated Convolutional Networks
Du Tran
Heng Wang
Lorenzo Torresani
Matt Feiszli
3DV
61
586
0
04 Apr 2019
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
162
3,262
0
10 Dec 2018
TSM: Temporal Shift Module for Efficient Video Understanding
TSM: Temporal Shift Module for Efficient Video Understanding
Ji Lin
Chuang Gan
Song Han
85
1,683
0
20 Nov 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in
  Video Classification
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Patrick Murphy
3DH
137
1,325
0
13 Dec 2017
ConvNet Architecture Search for Spatiotemporal Feature Learning
ConvNet Architecture Search for Spatiotemporal Feature Learning
Du Tran
Jamie Ray
Zheng Shou
Shih-Fu Chang
Manohar Paluri
3DPC
72
383
0
16 Aug 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
219
7,989
0
22 May 2017
Temporal Segment Networks for Action Recognition in Videos
Temporal Segment Networks for Action Recognition in Videos
Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
ViT
110
809
0
08 May 2017
TALL: Temporal Activity Localization via Language Query
TALL: Temporal Activity Localization via Language Query
J. Gao
Chen Sun
Zhenheng Yang
Ram Nevatia
120
819
0
05 May 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
72
825
0
28 Mar 2017
Aggregated Residual Transformations for Deep Neural Networks
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie
Ross B. Girshick
Piotr Dollár
Zhuowen Tu
Kaiming He
483
10,305
0
16 Nov 2016
The THUMOS Challenge on Action Recognition for Videos "in the Wild"
The THUMOS Challenge on Action Recognition for Videos "in the Wild"
Haroon Idrees
Amir Zamir
Yu-Gang Jiang
Alexander N. Gorban
Ivan Laptev
Rahul Sukthankar
M. Shah
76
775
0
21 Apr 2016
Batch Normalization: Accelerating Deep Network Training by Reducing
  Internal Covariate Shift
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe
Christian Szegedy
OOD
421
43,234
0
11 Feb 2015
CIDEr: Consensus-based Image Description Evaluation
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
256
4,471
0
20 Nov 2014
1