ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.00650
  4. Cited By
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

1 April 2021
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
    VGen
ArXivPDFHTML

Papers citing "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval"

50 / 282 papers shown
Title
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
Kate Sanders
David Etter
Reno Kriz
Benjamin Van Durme
VGen
42
7
0
06 Jul 2023
Valley: Video Assistant with Large Language model Enhanced abilitY
Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo
Ziwang Zhao
Min Yang
Junwei Dong
Da Li
Pengcheng Lu
Tao Wang
Linmei Hu
Ming-Hui Qiu
MLLM
57
191
0
12 Jun 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability
VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang
Hangjie Yuan
Shiwei Zhang
Dayou Chen
Jiuniu Wang
Yingya Zhang
Yujun Shen
Deli Zhao
Jingren Zhou
VGen
DiffM
33
319
0
03 Jun 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal
  Co-Denoising
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising
Fu Lee Wang
Wenshuo Chen
Guanglu Song
Han-Jia Ye
Yu Liu
Hongsheng Li
VGen
DiffM
50
89
0
29 May 2023
ChatBridge: Bridging Modalities with Large Language Model as a Language
  Catalyst
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Zijia Zhao
Longteng Guo
Tongtian Yue
Si-Qing Chen
Shuai Shao
Xinxin Zhu
Zehuan Yuan
Jing Liu
MLLM
40
53
0
25 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
  Scale
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
24
9
0
23 May 2023
EDIS: Entity-Driven Image Search over Multimodal Web Content
EDIS: Entity-Driven Image Search over Multimodal Web Content
Siqi Liu
Weixi Feng
Tsu-Jui Fu
Wenhu Chen
Wei Wang
VLM
48
9
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
23
17
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language,
  and Speech Data
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
17
2
0
21 May 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set
  Alignment
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Peng Jin
Hao Li
Ze-Long Cheng
Jinfa Huang
Zhennan Wang
Li-ming Yuan
Chang-rui Liu
Jie Chen
38
32
0
20 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner
  and Dense Captioner
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
15
1
0
19 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language
  Models in the Low-data Regime
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
47
2
0
03 May 2023
Align your Latents: High-Resolution Video Synthesis with Latent
  Diffusion Models
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
A. Blattmann
Robin Rombach
Huan Ling
Tim Dockhorn
Seung Wook Kim
Sanja Fidler
Karsten Kreis
3DGS
VGen
127
1,019
0
18 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
31
12
0
18 Apr 2023
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
  Text-to-Video Generation
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
Jie An
Songyang Zhang
Harry Yang
Sonal Gupta
Jia-Bin Huang
Jiebo Luo
Xiaoyue Yin
DiffM
VGen
38
107
0
17 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
36
37
0
09 Apr 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
42
20
0
05 Apr 2023
AutoAD: Movie Description in Context
AutoAD: Movie Description in Context
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
24
34
0
29 Mar 2023
Hierarchical Video-Moment Retrieval and Step-Captioning
Hierarchical Video-Moment Retrieval and Step-Captioning
Abhaysinh Zala
Jaemin Cho
Satwik Kottur
Xilun Chen
Barlas Ouguz
Yasher Mehdad
Joey Tianyi Zhou
3DV
20
51
0
29 Mar 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
57
156
0
28 Mar 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for
  Cross-Modal Representation Learning
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Peng Jin
Jinfa Huang
Pengfei Xiong
Shangxuan Tian
Chang-rui Liu
Xiang Ji
Li-ming Yuan
Jie Chen
45
50
0
25 Mar 2023
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Jiahao Zhang
A. Cherian
Yanbin Liu
Yizhak Ben-Shabat
Cristian Rodriguez-Opazo
Stephen Gould
32
8
0
24 Mar 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning
Text with Knowledge Graph Augmented Transformer for Video Captioning
Xin Gu
G. Chen
Yufei Wang
Libo Zhang
Tiejian Luo
Longyin Wen
32
47
0
22 Mar 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Peng Jin
Hao Li
Ze-Long Cheng
Kehan Li
Xiang Ji
Chang-rui Liu
Li-ming Yuan
Jie Chen
DiffM
VGen
28
54
0
17 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
32
8
0
20 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
36
7
0
16 Feb 2023
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
  Image-Text Retrieval
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval
Ziyang Luo
Pu Zhao
Can Xu
Xiubo Geng
Tao Shen
Chongyang Tao
Jing Ma
Qingwen Lin
Daxin Jiang
VLM
CLIP
29
3
0
06 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Yizhen Chen
Jie Wang
Lijian Lin
Zhongang Qi
Jin Ma
Ying Shan
VLM
33
18
0
30 Jan 2023
Temporal Perceiving Video-Language Pre-training
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
32
15
0
18 Jan 2023
UATVR: Uncertainty-Adaptive Text-Video Retrieval
UATVR: Uncertainty-Adaptive Text-Video Retrieval
Bo Fang
Wenhao Wu
Chang-rui Liu
Yu Zhou
Yuxin Song
Weiping Wang
Min Yang
Xiang Ji
Jingdong Wang
26
46
0
16 Jan 2023
HierVL: Learning Hierarchical Video-Language Embeddings
HierVL: Learning Hierarchical Video-Language Embeddings
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
VLM
AI4TS
28
53
0
05 Jan 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
29
4
0
05 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
86
36
0
05 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
Learning Trajectory-Word Alignments for Video-Language Tasks
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
36
7
0
05 Jan 2023
Ego-Only: Egocentric Action Detection without Exocentric Transferring
Ego-Only: Egocentric Action Detection without Exocentric Transferring
Huiyu Wang
Mitesh Singh
Lorenzo Torresani
EgoV
77
24
0
03 Jan 2023
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Santhosh Kumar Ramakrishnan
Ziad Al-Halah
Kristen Grauman
119
39
0
02 Jan 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
  Text-to-Video Generation
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu
Yixiao Ge
Xintao Wang
Weixian Lei
Yuchao Gu
Yufei Shi
W. Hsu
Ying Shan
Xiaohu Qie
Mike Zheng Shou
VGen
64
694
0
22 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
24
37
0
19 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
38
54
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
57
311
0
06 Dec 2022
AbHE: All Attention-based Homography Estimation
AbHE: All Attention-based Homography Estimation
Mingxiao Huo
Zhihao Zhang
Xinyang Ren
Xianqiang Yang
29
6
0
06 Dec 2022
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
50
0
0
05 Dec 2022
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu
Biaolong Chen
Yue Liao
Shuwen Xiao
Wenyu Sun
Xiaobo Li
Yousong Zhu
Jinqiao Wang
Si Liu
CLIP
35
11
0
02 Dec 2022
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim
Nam-Won Kim
Suha Kwak
26
37
0
30 Nov 2022
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
  Semantic Segmentation
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
Huaishao Luo
Junwei Bao
Youzheng Wu
Xiaodong He
Tianrui Li
VLM
32
146
0
27 Nov 2022
Tell Me What Happened: Unifying Text-guided Video Completion via
  Multimodal Masked Video Generation
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Tsu-Jui Fu
Licheng Yu
Ning Zhang
Cheng-Yang Fu
Jong-Chyi Su
William Yang Wang
Sean Bell
VGen
61
37
0
23 Nov 2022
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
Siteng Huang
Biao Gong
Yulin Pan
Jianwen Jiang
Yiliang Lv
Yuyuan Li
Donglin Wang
VLM
VPVLM
22
41
0
23 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
  Latent Attention
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
24
9
0
21 Nov 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language
  Pre-training
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
34
15
0
21 Nov 2022
Previous
123456
Next