Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2007.10639
Cited By
Multi-modal Transformer for Video Retrieval
21 July 2020
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Multi-modal Transformer for Video Retrieval"
50 / 133 papers shown
Title
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
46
11
0
10 Oct 2022
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval
Zheng Li
Caili Guo
Xin Eric Wang
Zerun Feng
Jenq-Neng Hwang
Zhongtian Du
VLM
24
2
0
28 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
25
68
0
14 Sep 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
21
63
0
04 Sep 2022
Partially Relevant Video Retrieval
Jianfeng Dong
Xianke Chen
Minsong Zhang
Xun Yang
Shujie Chen
Xirong Li
Xun Wang
14
39
0
26 Aug 2022
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
Shuo Liu
Weize Quan
Mingyuan Zhou
Sihong Chen
Jian Kang
Zhenlan Zhao
Chen Chen
Dong-Ming Yan
20
0
0
16 Aug 2022
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
Alex Falcon
G. Serra
O. Lanz
VGen
34
25
0
03 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
20
18
0
01 Aug 2022
Temporal and cross-modal attention for audio-visual zero-shot learning
Otniel-Bogdan Mercea
Thomas Hummel
A. Sophia Koepke
Zeynep Akata
35
25
0
20 Jul 2022
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang
Yinan Li
Jiashi Feng
Xinglong Wu
Xiaoshuai Sun
Rongrong Ji
VLM
19
48
0
16 Jul 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Yuqi Liu
Pengfei Xiong
Luhui Xu
Shengming Cao
Qin Jin
30
113
0
16 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
17
268
0
15 Jul 2022
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
Burak Satar
Hongyuan Zhu
Hanwang Zhang
J. Lim
24
3
0
29 Jun 2022
Context-Aware Transformers For Spinal Cancer Detection and Radiological Grading
Rhydian Windsor
A. Jamaludin
T. Kadir
Andrew Zisserman
ViT
MedIm
8
13
0
27 Jun 2022
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Y. S. Rawat
M. Shah
SSL
34
131
0
18 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
20
81
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
50
525
0
13 Jun 2022
Learning Speaker-specific Lip-to-Speech Generation
Munender Varshney
Ravindra Yadav
Vinay P. Namboodiri
R. Hegde
16
7
0
04 Jun 2022
VRAG: Region Attention Graphs for Content-Based Video Retrieval
K. Ng
Ser-Nam Lim
G. Lee
27
4
0
18 May 2022
Learning to Retrieve Videos by Asking Questions
Avinash Madasu
Junier Oliva
Gedas Bertasius
VGen
30
16
0
11 May 2022
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
34
33
0
10 May 2022
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
Henghui Zhao
Isma Hadji
Nikita Dvornik
Konstantinos G. Derpanis
Richard P. Wildes
Allan D. Jepson
26
45
0
04 May 2022
Relevance-based Margin for Contrastively-trained Video Retrieval Models
Alex Falcon
Swathikiran Sudhakaran
G. Serra
Sergio Escalera
O. Lanz
37
7
0
27 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
13
43
0
26 Apr 2022
Modality-Balanced Embedding for Video Retrieval
Xun Wang
Bingqing Ke
Xuanping Li
Fangyu Liu
Mingyu Zhang
Xiao Liang
Qi-En Xiao
Cheng Luo
Yue Yu
19
9
0
18 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
19
54
0
15 Apr 2022
Are Multimodal Transformers Robust to Missing Modality?
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
26
147
0
12 Apr 2022
Probabilistic Representations for Video Contrastive Learning
Jungin Park
Jiyoung Lee
Ig-Jae Kim
K. Sohn
SSL
26
43
0
08 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
20
18
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
41
39
0
06 Apr 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
28
94
0
30 Mar 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
31
149
0
28 Mar 2022
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
Hazel Doughty
Cees G. M. Snoek
25
19
0
23 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
29
74
0
18 Mar 2022
Disentangled Representation Learning for Text-Video Retrieval
Qiang Wang
Yanhao Zhang
Yun Zheng
Pan Pan
Xiansheng Hua
45
76
0
14 Mar 2022
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Alexander Kunitsyn
M. Kalashnikov
Maksim Dzabraev
Andrei Ivaniuta
28
16
0
14 Mar 2022
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
Jing Tan
Yuhong Wang
Gangshan Wu
Limin Wang
43
14
0
01 Mar 2022
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
Kunchang Li
Yali Wang
Junhao Zhang
Peng Gao
Guanglu Song
Yu Liu
Hongsheng Li
Yu Qiao
ViT
150
361
0
24 Jan 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
27
164
0
20 Jan 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
22
103
0
16 Jan 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
24
108
0
13 Jan 2022
Sign Language Video Retrieval with Free-Form Textual Queries
A. Duarte
Samuel Albanie
Xavier Giró-i-Nieto
Gül Varol
SLR
38
29
0
07 Jan 2022
Cross Modal Retrieval with Querybank Normalisation
Simion-Vlad Bogolin
Ioana Croitoru
Hailin Jin
Yang Liu
Samuel Albanie
27
84
0
23 Dec 2021
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Dongxu Li
Junnan Li
Hongdong Li
Juan Carlos Niebles
S. Hoi
22
191
0
17 Dec 2021
TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning
Shiming Chen
Zi-Quan Hong
Wenjin Hou
Guosen Xie
Yibing Song
Jian-jun Zhao
Xinge You
Shuicheng Yan
Ling Shao
ViT
17
44
0
16 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
38
686
0
08 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
31
129
0
08 Dec 2021
TransZero: Attribute-guided Transformer for Zero-Shot Learning
Shiming Chen
Ziming Hong
Yang Liu
Guosen Xie
Baigui Sun
Hao Li
Qinmu Peng
Kelvin Lu
Xinge You
ViT
42
131
0
03 Dec 2021
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
30
23
0
02 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
17
79
0
01 Dec 2021
Previous
1
2
3
Next