ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.04446
  4. Cited By
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

8 December 2021
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
    ViT
ArXivPDFHTML

Papers citing "Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval"

21 / 71 papers shown
Title
Generalizing Multimodal Variational Methods to Sets
Generalizing Multimodal Variational Methods to Sets
Jinzhao Zhou
Yiqun Duan
Zhihong Chen
Yu-Cheng Chang
Chin-Teng Lin
DRL
45
0
0
19 Dec 2022
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
40
0
0
05 Dec 2022
Anticipative Feature Fusion Transformer for Multi-Modal Action
  Anticipation
Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation
Zeyun Zhong
David Schneider
Michael Voit
Rainer Stiefelhagen
Jürgen Beyerer
71
44
0
23 Oct 2022
Temporal Action Segmentation: An Analysis of Modern Techniques
Temporal Action Segmentation: An Analysis of Modern Techniques
Guodong Ding
Fadime Sener
Angela Yao
42
74
0
19 Oct 2022
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
  Text-Video Retrieval
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko
Yung-Sung Chuang
Nina Shvetsova
Samuel Thomas
Rogerio Feris
Brian Kingsbury
Leonid Karlinsky
David F. Harwath
Hilde Kuehne
James R. Glass
VLM
28
4
0
07 Oct 2022
Unified Loss of Pair Similarity Optimization for Vision-Language
  Retrieval
Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval
Zheng Li
Caili Guo
Xin Wang
Zerun Feng
Jenq-Neng Hwang
Zhongtian Du
VLM
24
2
0
28 Sep 2022
Vision Transformers for Action Recognition: A Survey
Vision Transformers for Action Recognition: A Survey
Anwaar Ulhaq
Naveed Akhtar
Ganna Pogrebna
Ajmal Saeed Mian
ViT
19
44
0
13 Sep 2022
Fusion of Satellite Images and Weather Data with Transformer Networks
  for Downy Mildew Disease Detection
Fusion of Satellite Images and Weather Data with Transformer Networks for Downy Mildew Disease Detection
William Maillet
Maryam Ouhami
A. Hafiane
ViT
MedIm
16
6
0
06 Sep 2022
UAVM: Towards Unifying Audio and Visual Models
UAVM: Towards Unifying Audio and Visual Models
Yuan Gong
Alexander H. Liu
Andrew Rouditchenko
James R. Glass
27
20
0
29 Jul 2022
Radiomics-Guided Global-Local Transformer for Weakly Supervised
  Pathology Localization in Chest X-Rays
Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays
Yan Han
G. Holste
Ying Ding
Ahmed H. Tewfik
Yifan Peng
Zhangyang Wang
LM&MA
ViT
26
14
0
10 Jul 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
54
527
0
13 Jun 2022
A CLIP-Hitchhiker's Guide to Long Video Retrieval
A CLIP-Hitchhiker's Guide to Long Video Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
CLIP
126
62
0
17 May 2022
Contrastive language and vision learning of general fashion concepts
Contrastive language and vision learning of general fashion concepts
P. Chia
Giuseppe Attanasio
Federico Bianchi
Silvia Terragni
A. Magalhães
Diogo Gonçalves
C. Greco
Jacopo Tagliabue
CLIP
18
42
0
08 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
41
39
0
06 Apr 2022
Learning Audio-Video Modalities from Image Captions
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
11
82
0
01 Apr 2022
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
248
577
0
22 Apr 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang
Linchao Zhu
Yi Yang
164
170
0
20 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
  Retrieval
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
317
780
0
18 Apr 2021
Multi-modal Transformer for Video Retrieval
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
424
596
0
21 Jul 2020
Audiovisual SlowFast Networks for Video Recognition
Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao
Yong Jae Lee
Kristen Grauman
Jitendra Malik
Christoph Feichtenhofer
197
206
0
23 Jan 2020
Efficient Estimation of Word Representations in Vector Space
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
242
31,257
0
16 Jan 2013
Previous
12