Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2206.12845
Cited By
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
26 June 2022
Burak Satar
Erik Cambria
Hanwang Zhang
J. Lim
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval"
15 / 15 papers shown
Title
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
79
0
0
02 Apr 2025
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Longrong Yang
Dong Shen
Chaoxiang Cai
Fan Yang
Size Li
Tingting Gao
Xi Li
MoE
87
2
0
28 Jun 2024
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
50
80
0
01 Dec 2021
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLM
MLLM
MoE
61
543
0
03 Nov 2021
Masking Modalities for Cross-modal Video Retrieval
Valentin Gabeur
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
34
30
0
01 Nov 2021
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
103
419
0
14 Nov 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
504
602
0
21 Jul 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
...
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
SSL
55
142
0
16 Jun 2020
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Shizhe Chen
Yida Zhao
Qin Jin
Qi Wu
74
311
0
01 Mar 2020
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Yale Song
M. Soleymani
47
242
0
11 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
91
1,192
0
07 Jun 2019
Cross-Modal and Hierarchical Modeling of Video and Text
Bowen Zhang
Hexiang Hu
Fei Sha
BDL
AI4TS
39
189
0
16 Oct 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
45
234
0
07 Apr 2018
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
206
7,961
0
22 May 2017
Modeling Relational Data with Graph Convolutional Networks
Michael Schlichtkrull
Thomas Kipf
Peter Bloem
Rianne van den Berg
Ivan Titov
Max Welling
GNN
150
4,772
0
17 Mar 2017
1