ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.10699
  4. Cited By
MDMMT: Multidomain Multimodal Transformer for Video Retrieval

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

19 March 2021
Maksim Dzabraev
M. Kalashnikov
Stepan Alekseevich Komkov
Aleksandr Petiushko
ArXivPDFHTML

Papers citing "MDMMT: Multidomain Multimodal Transformer for Video Retrieval"

50 / 78 papers shown
Title
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
26
0
0
07 Apr 2025
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
33
0
0
02 Apr 2025
Explainable and Interpretable Multimodal Large Language Models: A
  Comprehensive Survey
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Yunkai Dang
Kaichen Huang
Jiahao Huo
Yibo Yan
S. Huang
...
Kun Wang
Yong Liu
Jing Shao
Hui Xiong
Xuming Hu
LRM
101
14
0
03 Dec 2024
SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via
  Contrast Learning for Multimodal Object Detection
SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection
Shuhan Dong
Yunsong Li
Weiying Xie
Jiaqing Zhang
Jiayuan Tian
Danian Yang
Jie Lei
31
1
0
15 Oct 2024
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Lianyu Hu
Tongkai Shi
Wei Feng
Fanhua Shang
Liang Wan
VLM
31
1
0
09 Oct 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of
  Modalities
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye
Hanane Azzag
M. Lebbah
ObjD
32
0
0
17 Sep 2024
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
Ni Wang
Dongliang Liao
Xing Xu
33
0
0
23 Jun 2024
An Empirical Study of Excitation and Aggregation Design Adaptions in
  CLIP4Clip for Video-Text Retrieval
An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
CLIP
29
1
0
25 May 2024
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding
  Space Binding
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding
Jiajie Teng
Huiyu Duan
Yucheng Zhu
Sijing Wu
Guangtao Zhai
29
2
0
15 May 2024
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Han Fang
Xianghao Zang
Chao Ban
Zerun Feng
Lanxiang Zhou
Zhongjiang He
Yongxiang Li
Hao Sun
29
1
0
18 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
23
5
0
12 Apr 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang
Guohao Sun
Pichao Wang
Dongfang Liu
S. Dianat
Majid Rabbani
Raghuveer M. Rao
Zhiqiang Tao
VGen
55
20
0
26 Mar 2024
Towards Zero-shot Human-Object Interaction Detection via Vision-Language
  Integration
Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration
Weiying Xue
Qi Liu
Qiwei Xiong
Yuxiao Wang
Zhenao Wei
Xiaofen Xing
Xiangmin Xu
VLM
36
3
0
12 Mar 2024
Multi-modal News Understanding with Professionally Labelled Videos
  (ReutersViLNews)
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Shih-Han Chou
Matthew Kowal
Yasmin Niknam
Diana Moyano
Shayaan Mehdi
...
Cheng Zhang
Ian Knopke
S. Kocak
Leonid Sigal
Yalda Mohsenzadeh
35
1
0
23 Jan 2024
Leveraging Generative Language Models for Weakly Supervised Sentence
  Component Analysis in Video-Language Joint Learning
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Rahul Pratap Singh
Bishmoy Paul
Ali Dabouei
Min Xu
17
1
0
10 Dec 2023
Detecting Any Human-Object Interaction Relationship: Universal HOI
  Detector with Spatial Prompt Learning on Foundation Models
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models
Yichao Cao
Qingfei Tang
Xiu Su
Chen Song
Shan You
Xiaobo Lu
Chang Xu
27
21
0
07 Nov 2023
Lost Your Style? Navigating with Semantic-Level Approach for
  Text-to-Outfit Retrieval
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval
Junkyu Jang
Eugene Hwang
Sung-Hyuk Park
22
0
0
03 Nov 2023
Encoding and Decoding Narratives: Datafication and Alternative Access
  Models for Audiovisual Archives
Encoding and Decoding Narratives: Datafication and Alternative Access Models for Audiovisual Archives
Yuchen Yang
33
1
0
10 Oct 2023
Write What You Want: Applying Text-to-video Retrieval to Audiovisual
  Archives
Write What You Want: Applying Text-to-video Retrieval to Audiovisual Archives
Yuchen Yang
VGen
19
7
0
09 Oct 2023
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
  Retrieval
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
Hao Li
Marie-Jeanne Lesot
Lianli Gao
Xiaosu Zhu
Christophe Marsala
EDL
16
11
0
29 Sep 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Mohit Bansal
95
44
0
18 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for
  Text-Video Retrieval
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
23
3
0
16 Sep 2023
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
  Correlations for Language-guided HOI detection
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection
Yichao Cao
Qingfei Tang
Fengyuan Yang
Xiu Su
Shan You
Xiaobo Lu
Chang Xu
24
16
0
25 Jul 2023
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
  and Zoom-in Boundary Detection
No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Qi Zhang
S. Zheng
Qin Jin
17
1
0
20 Jul 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text
  Retrieval
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
31
2
0
13 May 2023
Dialogue-to-Video Retrieval
Dialogue-to-Video Retrieval
Chenyang Lyu
Manh-Duy Nguyen
Van-Tu Ninh
Liting Zhou
C. Gurrin
Jennifer Foster
32
1
0
23 Mar 2023
CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft
CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft
Ziluo Ding
Hao Luo
Ke Li
Junpeng Yue
Tiejun Huang
Zongqing Lu
VLM
23
11
0
19 Mar 2023
Weakly-Supervised HOI Detection from Interaction Labels Only and
  Language/Vision-Language Priors
Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors
Mesut Erhan Unal
Adriana Kovashka
VLM
18
5
0
09 Mar 2023
Deep Learning for Video-Text Retrieval: a Review
Deep Learning for Video-Text Retrieval: a Review
Cunjuan Zhu
Qi Jia
Wei-Neng Chen
Yanming Guo
Yu Liu
24
14
0
24 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
27
8
0
20 Feb 2023
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Yimu Wang
Peng Shi
13
5
0
19 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Yizhen Chen
Jie Wang
Lijian Lin
Zhongang Qi
Jin Ma
Ying Shan
VLM
21
18
0
30 Jan 2023
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
Manh-Duy Nguyen
Binh T. Nguyen
C. Gurrin
VLM
28
4
0
11 Jan 2023
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised
  Video Anomaly Detection
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection
Kevin Hyekang Joo
Khoa T. Vo
Kashu Yamazaki
Ngan Le
19
38
0
09 Dec 2022
MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video
  Prediction
MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction
Shuliang Ning
Mengcheng Lan
Yanran Li
Chaofeng Chen
Qian Chen
Xunlai Chen
Xiaoguang Han
Shuguang Cui
30
20
0
09 Dec 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language
  Pre-training
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
28
15
0
21 Nov 2022
Multimodal Transformer for Parallel Concatenated Variational
  Autoencoders
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
Stephen D. Liang
J. Mendel
ViT
19
5
0
28 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video
  Retrieval
RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
29
7
0
13 Oct 2022
Learning to Locate Visual Answer in Video Corpus Using Question
Learning to Locate Visual Answer in Video Corpus Using Question
Bin Li
Yixuan Weng
Bin Sun
Shutao Li
6
5
0
11 Oct 2022
MuMUR : Multilingual Multimodal Universal Retrieval
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
39
3
0
24 Aug 2022
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
Shuo Liu
Weize Quan
Mingyuan Zhou
Sihong Chen
Jian Kang
Zhenlan Zhao
Chen Chen
Dong-Ming Yan
17
0
0
16 Aug 2022
Boosting Video-Text Retrieval with Explicit High-Level Semantics
Boosting Video-Text Retrieval with Explicit High-Level Semantics
Haoran Wang
Di Xu
Dongliang He
Fu Li
Zhong Ji
Jungong Han
Errui Ding
24
11
0
08 Aug 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Yuqi Liu
Pengfei Xiong
Luhui Xu
Shengming Cao
Qin Jin
25
113
0
16 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
15
268
0
15 Jul 2022
Learning to Retrieve Videos by Asking Questions
Learning to Retrieve Videos by Asking Questions
Avinash Madasu
Junier Oliva
Gedas Bertasius
VGen
30
16
0
11 May 2022
Attract me to Buy: Advertisement Copywriting Generation with Multimodal
  Multi-structured Information
Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information
Zhipeng Zhang
Xinglin Hou
K. Niu
Zhongzhen Huang
T. Ge
Yuning Jiang
Qi Wu
Peifeng Wang
26
4
0
07 May 2022
Learn to Understand Negation in Video Retrieval
Learn to Understand Negation in Video Retrieval
Ziyue Wang
Aozhu Chen
Fan Hu
Xirong Li
SSL
11
12
0
30 Apr 2022
Relevance-based Margin for Contrastively-trained Video Retrieval Models
Relevance-based Margin for Contrastively-trained Video Retrieval Models
Alex Falcon
Swathikiran Sudhakaran
G. Serra
Sergio Escalera
O. Lanz
34
7
0
27 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
20
18
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
41
39
0
06 Apr 2022
12
Next