ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.10575
  4. Cited By
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

24 February 2025
Haoran Tang
Meng Cao
Jinfa Huang
Ruyang Liu
Peng Jin
Ge Li
Xiaodan Liang
    Mamba
ArXiv (abs)PDFHTML

Papers citing "MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval"

50 / 50 papers shown
Title
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
Saron Samuel
Dan DeGenaro
Jimena Guallar-Blasco
Kate Sanders
Oluwaseun Eisape
...
David Etter
Efsun Kayi
Matthew Wiesner
Kenton W. Murray
Reno Kriz
185
0
0
26 Mar 2025
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
Xiuwei Chen
Sihao Lin
Xiao Dong
Zhenpeng Chen
Meng Cao
Jiawei Han
Hang Xu
Xiaodan Liang
Mamba
124
1
0
24 Feb 2025
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha
Monish Soundar Raj
Pu Wang
Ahmed Helmy
Srijan Das
Mamba
158
3
0
10 Jan 2025
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Meng Cao
Haoran Tang
Haoze Zhao
Hangyu Guo
Jing Liu
Ge Zhang
Ruyang Liu
Qiang Sun
Ian Reid
Xiaodan Liang
219
3
0
02 Dec 2024
Uncertainty-aware sign language video retrieval with probability
  distribution modeling
Uncertainty-aware sign language video retrieval with probability distribution modeling
Xuan Wu
Hongxiang Li
Yuanjiang Luo
Xuxin Cheng
Xianwei Zhuang
Meng Cao
Keren Fu
UQLMSLR
69
6
0
30 May 2024
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao
Haoran Tang
Jinfa Huang
Peng Jin
Can Zhang
Ruyang Liu
Long Chen
Xiaodan Liang
Li-ming Yuan
Ge Li
169
14
0
29 May 2024
MambaOut: Do We Really Need Mamba for Vision?
MambaOut: Do We Really Need Mamba for Vision?
Weihao Yu
Xinchao Wang
Mamba
99
60
0
13 May 2024
ST-LLM: Large Language Models Are Effective Temporal Learners
ST-LLM: Large Language Models Are Effective Temporal Learners
Ruyang Liu
Chen Li
Haoran Tang
Yixiao Ge
Ying Shan
Ge Li
107
82
0
30 Mar 2024
Gamba: Marry Gaussian Splatting with Mamba for single view 3D
  reconstruction
Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction
Qiuhong Shen
Zike Wu
Xuanyu Yi
Pan Zhou
Hanwang Zhang
Shuicheng Yan
Xinchao Wang
3DGS
103
44
0
27 Mar 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang
Guohao Sun
Pichao Wang
Dongfang Liu
S. Dianat
Majid Rabbani
Raghuveer M. Rao
Zhiqiang Tao
VGen
124
26
0
26 Mar 2024
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
Chenhongyi Yang
Zehui Chen
Miguel Espinosa
Linus Ericsson
Zhenyu Wang
Jiaming Liu
Elliot J. Crowley
Mamba
123
99
0
26 Mar 2024
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
Xiaohuan Pei
Tao Huang
Chang Xu
Mamba
98
101
0
15 Mar 2024
Video Mamba Suite: State Space Model as a Versatile Alternative for
  Video Understanding
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Guo Chen
Yifei Huang
Jilan Xu
Baoqi Pei
Zhe Chen
Zhiqi Li
Jiahao Wang
Kunchang Li
Tong Lu
Limin Wang
Mamba
143
78
0
14 Mar 2024
VideoMamba: State Space Model for Efficient Video Understanding
VideoMamba: State Space Model for Efficient Video Understanding
Kunchang Li
Xinhao Li
Yi Wang
Yinan He
Yali Wang
Limin Wang
Yu Qiao
Mamba
69
214
0
11 Mar 2024
Point Mamba: A Novel Point Cloud Backbone Based on State Space Model
  with Octree-Based Ordering Strategy
Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy
Jiuming Liu
Ruiji Yu
Yian Wang
Yu Zheng
Tianchen Deng
Weicai Ye
Hesheng Wang
107
48
0
11 Mar 2024
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu
Tri Dao
Mamba
168
2,855
0
01 Dec 2023
Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General
  Healthcare
Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare
Junling Liu
Ziming Wang
Qichen Ye
Dading Chong
Peilin Zhou
Yining Hua
VLMLM&MA
80
54
0
27 Oct 2023
Video Referring Expression Comprehension via Transformer with
  Content-conditioned Query
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Jiang Ji
Meng Cao
Tengtao Song
Long Chen
Yi Wang
Yuexian Zou
96
6
0
25 Oct 2023
BT-Adapter: Video Conversation is Feasible Without Video Instruction
  Tuning
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Ruyang Liu
Chen Li
Yixiao Ge
Ying Shan
Thomas H. Li
Ge Li
116
30
0
27 Sep 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Joey Tianyi Zhou
214
49
0
18 Sep 2023
Improving Reference-based Distinctive Image Captioning with Contrastive
  Rewards
Improving Reference-based Distinctive Image Captioning with Contrastive Rewards
Yangjun Mao
Jun Xiao
Dong Zhang
Meng Cao
Jian Shao
Yueting Zhuang
Long Chen
EGVM
76
9
0
25 Jun 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set
  Alignment
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Peng Jin
Hao Li
Ze-Long Cheng
Jinfa Huang
Zhennan Wang
Li-ming Yuan
Chang-rui Liu
Jie Chen
117
37
0
20 May 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
136
169
0
28 Mar 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for
  Cross-Modal Representation Learning
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Peng Jin
Jinfa Huang
Pengfei Xiong
Shangxuan Tian
Chang-rui Liu
Xiang Ji
Li-ming Yuan
Jie Chen
101
59
0
25 Mar 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Peng Jin
Hao Li
Ze-Long Cheng
Kehan Li
Xiang Ji
Chang-rui Liu
Li-ming Yuan
Jie Chen
DiffMVGen
93
58
0
17 Mar 2023
Improving Video Retrieval by Adaptive Margin
Improving Video Retrieval by Adaptive Margin
Feng He
Qi Wang
Zhifan Feng
Wenbin Jiang
Yajuan Lü
Yong Zhu
Xiao Tan
137
22
0
09 Mar 2023
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
  Transferring
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Ruyang Liu
Jingjia Huang
Ge Li
Jiashi Feng
Xing Wu
Thomas H. Li
AI4TSCLIPVLM
112
48
0
26 Jan 2023
Exploiting Auxiliary Caption for Video Grounding
Exploiting Auxiliary Caption for Video Grounding
Hongxiang Li
Meng Cao
Xuxin Cheng
Zhihong Zhu
Yaowei Li
Yuexian Zou
74
10
0
15 Jan 2023
Expectation-Maximization Contrastive Learning for Compact
  Video-and-Language Representations
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Peng Jin
Jinfa Huang
Fenglin Liu
Xian Wu
Shen Ge
Guoli Song
David Clifton
Jing Chen
VLM
112
69
0
21 Nov 2022
Correspondence Matters for Video Referring Expression Comprehension
Correspondence Matters for Video Referring Expression Comprehension
Meng Cao
Ji Jiang
Long Chen
Yuexian Zou
VOS
84
20
0
21 Jul 2022
LocVTP: Video-Text Pre-training for Temporal Localization
LocVTP: Video-Text Pre-training for Temporal Localization
Meng Cao
Tianyu Yang
Junwu Weng
Can Zhang
Jue Wang
Yuexian Zou
92
65
0
21 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIPVLM
111
293
0
15 Jul 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with
  IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
448
2,299
0
27 May 2022
Vision Transformer Adapter for Dense Predictions
Vision Transformer Adapter for Dense Predictions
Zhe Chen
Yuchen Duan
Wenhai Wang
Junjun He
Tong Lu
Jifeng Dai
Yu Qiao
182
572
0
17 May 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
124
20
0
07 Apr 2022
Exploring Plain Vision Transformer Backbones for Object Detection
Exploring Plain Vision Transformer Backbones for Object Detection
Yanghao Li
Hanzi Mao
Ross B. Girshick
Kaiming He
ViT
108
823
0
30 Mar 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
101
162
0
28 Mar 2022
Disentangled Representation Learning for Text-Video Retrieval
Disentangled Representation Learning for Text-Video Retrieval
Qiang Wang
Yanhao Zhang
Yun Zheng
Pan Pan
Xiansheng Hua
82
81
0
14 Mar 2022
On Pursuit of Designing Multi-modal Transformer for Video Grounding
On Pursuit of Designing Multi-modal Transformer for Video Grounding
Meng Cao
Long Chen
Mike Zheng Shou
Can Zhang
Yuexian Zou
86
81
0
13 Sep 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
540
21,854
0
25 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
1.1K
30,116
0
26 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse
  Sampling
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
197
666
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
442
2,080
0
09 Feb 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
795
41,946
0
22 Oct 2020
Multi-modal Transformer for Video Retrieval
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
608
612
0
21 Jul 2020
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Jie Lei
Licheng Yu
Tamara L. Berg
Joey Tianyi Zhou
228
288
0
24 Jan 2020
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
217
3,302
0
10 Dec 2018
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Youngjae Yu
Jongseok Kim
Gunhee Kim
108
347
0
07 Aug 2018
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
1.0K
133,589
0
12 Jun 2017
Dense-Captioning Events in Videos
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
211
1,259
0
02 May 2017
1