ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.00650
  4. Cited By
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

1 April 2021
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
    VGen
ArXivPDFHTML

Papers citing "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval"

50 / 271 papers shown
Title
GazeFusion: Saliency-Guided Image Generation
GazeFusion: Saliency-Guided Image Generation
Yunxiang Zhang
Nan Wu
Connor Z. Lin
Gordon Wetzstein
Qi Sun
45
0
0
16 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
39
10
0
13 Mar 2024
Tuning-Free Noise Rectification for High Fidelity Image-to-Video
  Generation
Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation
Weijie Li
Litong Gong
Yiran Zhu
Fanda Fan
Biao Wang
Tiezheng Ge
Bo Zheng
VGen
DiffM
49
2
0
05 Mar 2024
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu
Shicheng Li
Yi Liu
Yuxiang Wang
Shuhuai Ren
Lei Li
Sishuo Chen
Xu Sun
Lu Hou
VLM
43
101
0
01 Mar 2024
Unifying Latent and Lexicon Representations for Effective Video-Text
  Retrieval
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
36
0
0
26 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
45
29
0
20 Feb 2024
World Model on Million-Length Video And Language With Blockwise RingAttention
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu
Wilson Yan
Matei A. Zaharia
Pieter Abbeel
VGen
39
63
0
13 Feb 2024
Boximator: Generating Rich and Controllable Motions for Video Synthesis
Boximator: Generating Rich and Controllable Motions for Video Synthesis
Jiawei Wang
Yuchen Zhang
Jiaxin Zou
Yan Zeng
Guoqiang Wei
Liping Yuan
Hang Li
DiffM
VGen
35
43
0
02 Feb 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with
  Explicit Motion Modeling
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
Xiaoyu Shi
Zhaoyang Huang
Fu-Yun Wang
Weikang Bian
Dasong Li
...
Ka Chun Cheung
Simon See
Hongwei Qin
Jifeng Da
Hongsheng Li
VGen
DiffM
43
81
0
29 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
69
35
0
16 Jan 2024
Video Anomaly Detection and Explanation via Large Language Models
Video Anomaly Detection and Explanation via Large Language Models
Hui Lv
Qianru Sun
33
21
0
11 Jan 2024
SnapCap: Efficient Snapshot Compressive Video Captioning
SnapCap: Efficient Snapshot Compressive Video Captioning
Jianqiao Sun
Yudi Su
Hao Zhang
Ziheng Cheng
Zequn Zeng
Zhengjue Wang
Bo Chen
Xin Yuan
32
1
0
10 Jan 2024
Latte: Latent Diffusion Transformer for Video Generation
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma
Yaohui Wang
Gengyun Jia
Xinyuan Chen
Ziqiang Liu
Yuan-Fang Li
Cunjian Chen
Yu Qiao
DiffM
VGen
125
242
0
05 Jan 2024
A Strong Baseline for Temporal Video-Text Alignment
A Strong Baseline for Temporal Video-Text Alignment
Zeqian Li
Qirui Chen
Tengda Han
Ya Zhang
Yanfeng Wang
Weidi Xie
AI4TS
VGen
43
5
0
21 Dec 2023
Video Recognition in Portrait Mode
Video Recognition in Portrait Mode
Mingfei Han
Linjie Yang
Xiaojie Jin
Jiashi Feng
Xiaojun Chang
Heng Wang
30
3
0
21 Dec 2023
MaskINT: Video Editing via Interpolative Non-autoregressive Masked
  Transformers
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
Haoyu Ma
Shahin Mahdizadehaghdam
Bichen Wu
Zhipeng Fan
Yuchao Gu
Wenliang Zhao
Lior Shapira
Xiaohui Xie
DiffM
VGen
30
4
0
19 Dec 2023
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan
Md. Mohaiminul Islam
Thomas Seidl
Gedas Bertasius
28
3
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Zhiwu Qing
Shiwei Zhang
Jiayu Wang
Xiang Wang
Yujie Wei
Yingya Zhang
Changxin Gao
Nong Sang
VGen
DiffM
32
37
0
07 Dec 2023
MotionCtrl: A Unified and Flexible Motion Controller for Video
  Generation
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
Zhouxia Wang
Ziyang Yuan
Xintao Wang
Tianshui Chen
Menghan Xia
Ping Luo
Ying Shan
DiffM
VGen
50
198
0
06 Dec 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
  Models
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models
Zhen Xing
Qi Dai
Zihao Zhang
Hui Zhang
Hang-Rui Hu
Zuxuan Wu
Yu-Gang Jiang
VGen
53
17
0
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
59
1
0
30 Nov 2023
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
Sherwin Bahmani
Ivan Skorokhodov
Victor Rong
Gordon Wetzstein
Leonidas J. Guibas
Peter Wonka
Sergey Tulyakov
Jeong Joon Park
Andrea Tagliasacchi
David B. Lindell
DiffM
54
103
0
29 Nov 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
43
264
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
84
410
0
28 Nov 2023
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via
  Blender-Oriented GPT Planning
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Jiaxi Lv
Yi Huang
Mingfu Yan
Jiancheng Huang
Jianzhuang Liu
Yifan Liu
Yafei Wen
Xiaoxin Chen
Shifeng Chen
VGen
DiffM
32
23
0
21 Nov 2023
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
  Models
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang
Jiayu Wang
Yingya Zhang
Kang Zhao
Hangjie Yuan
Zhan Qin
Xiang Wang
Deli Zhao
Jingren Zhou
DiffM
VGen
49
200
0
07 Nov 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision)
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Qinghong Lin
Faisal Ahmed
Linjie Li
Chung-Ching Lin
E. Azarnasab
...
Lin Liang
Zicheng Liu
Yumao Lu
Ce Liu
Lijuan Wang
MLLM
28
63
0
30 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
  Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
51
28
0
29 Oct 2023
Matryoshka Diffusion Models
Matryoshka Diffusion Models
Jiatao Gu
Shuangfei Zhai
Yizhen Zhang
Joshua M. Susskind
Navdeep Jaitly
DiffM
21
43
0
23 Oct 2023
Large Models for Time Series and Spatio-Temporal Data: A Survey and
  Outlook
Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook
Ming Jin
Qingsong Wen
Keli Zhang
Chaoli Zhang
Siqiao Xue
...
Shirui Pan
Vincent S. Tseng
Yu Zheng
Lei Chen
Hui Xiong
AI4TS
SyDa
40
118
0
16 Oct 2023
Is ImageNet worth 1 video? Learning strong image encoders from 1 long
  unlabelled video
Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video
Shashanka Venkataramanan
Mamshad Nayeem Rizve
João Carreira
Yuki M. Asano
Yannis Avrithis
SSL
37
18
0
12 Oct 2023
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
Rui Zhao
Yuchao Gu
Jay Zhangjie Wu
David Junhao Zhang
Jia-Wei Liu
Weijia Wu
Jussi Keppo
Mike Zheng Shou
DiffM
VGen
30
104
0
12 Oct 2023
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
32
36
0
10 Oct 2023
Latent Wander: an Alternative Interface for Interactive and
  Serendipitous Discovery of Large AV Archives
Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives
Yuchen Yang
Linyida Zhang
21
2
0
09 Oct 2023
Training a Large Video Model on a Single Machine in a Day
Training a Large Video Model on a Single Machine in a Day
Yue Zhao
Philipp Krahenbuhl
VLM
34
15
0
28 Sep 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLM
VLM
32
2
0
27 Sep 2023
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
  Animator
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
Hanzhuo Huang
Yufan Feng
Cheng Shi
Lan Xu
Jingyi Yu
Sibei Yang
DiffM
VGen
23
63
0
25 Sep 2023
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided
  Video DecodER
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Mingzhen Sun
Weining Wang
Zihan Qin
Jiahui Sun
Si-Qing Chen
Qingbin Liu
DiffM
37
3
0
23 Sep 2023
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
  Margin Contrastive Learning
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan Cheng
...
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming Yang
Yuan Qi
29
10
0
20 Sep 2023
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Jiaxi Gu
Shicong Wang
Haoyu Zhao
Tianyi Lu
Xing Zhang
Zuxuan Wu
Songcen Xu
Wei Zhang
Yu-Gang Jiang
Hang Xu
DiffM
VGen
39
44
0
07 Sep 2023
Representation Learning for Sequential Volumetric Design Tasks
Representation Learning for Sequential Volumetric Design Tasks
Md Ferdous Alam
Yi Wang
Linh Tran
Chin-Yi Cheng
Jieliang Luo
3DV
27
2
0
05 Sep 2023
VideoGen: A Reference-Guided Latent Diffusion Approach for High
  Definition Text-to-Video Generation
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Xin Li
Wenqing Chu
Ye Wu
Weihang Yuan
Fanglong Liu
Qi Zhang
Fu Li
Haocheng Feng
Errui Ding
Jingdong Wang
VGen
45
52
0
01 Sep 2023
AI-Generated Content (AIGC) for Various Data Modalities: A Survey
AI-Generated Content (AIGC) for Various Data Modalities: A Survey
Lin Geng Foo
Hossein Rahmani
Jun Liu
78
31
0
27 Aug 2023
MobileVidFactory: Automatic Diffusion-Based Social Media Video
  Generation for Mobile Devices from Text
MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text
Junchen Zhu
Huan Yang
Wenjing Wang
Huiguo He
Zixi Tuo
...
Wen-Huang Cheng
Lianli Gao
Jingkuan Song
Jianlong Fu
Jiebo Luo
DiffM
30
6
0
31 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
Fahad Shahbaz Khan
VLM
38
118
0
25 Jul 2023
TVPR: Text-to-Video Person Retrieval and a New Benchmark
TVPR: Text-to-Video Person Retrieval and a New Benchmark
Fan Ni
Xu Zhang
Jianhui Wu
Guan-Nan Dong
Aichun Zhu
Hui Liu
Yue Zhang
48
0
0
14 Jul 2023
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
Kate Sanders
David Etter
Reno Kriz
Benjamin Van Durme
VGen
42
7
0
06 Jul 2023
Valley: Video Assistant with Large Language model Enhanced abilitY
Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo
Ziwang Zhao
Min Yang
Junwei Dong
Da Li
Pengcheng Lu
Tao Wang
Linmei Hu
Ming-Hui Qiu
MLLM
54
191
0
12 Jun 2023
VideoComposer: Compositional Video Synthesis with Motion Controllability
VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang
Hangjie Yuan
Shiwei Zhang
Dayou Chen
Jiuniu Wang
Yingya Zhang
Yujun Shen
Deli Zhao
Jingren Zhou
VGen
DiffM
33
318
0
03 Jun 2023
Previous
123456
Next