Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2207.07285
Cited By
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
15 July 2022
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
50 / 168 papers shown
Title
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
Dhruv Verma
Debaditya Roy
Basura Fernando
27
1
0
30 Jul 2024
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval
Longtao Jiang
Min Wang
Zecheng Li
Yao Fang
Wen-gang Zhou
Houqiang Li
SLR
34
2
0
23 Jul 2024
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu
Yiyi Chen
Yan Li
Minquan Wang
Haoqian Wang
Quan Chen
Han Li
Peng Jiang
AI4TS
29
0
0
23 Jul 2024
Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models
Thinesh Thiyakesan Ponbagavathi
Kunyu Peng
Alina Roitberg
40
1
0
22 Jul 2024
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo
Pedro Morgado
33
5
0
18 Jul 2024
Multi-branch Collaborative Learning Network for 3D Visual Grounding
Zhipeng Qian
Yiwei Ma
Zhekai Lin
Jiayi Ji
Xiawu Zheng
Xiaoshuai Sun
Rongrong Ji
3DV
41
4
0
07 Jul 2024
Semantically Guided Representation Learning For Action Anticipation
Anxhelo Diko
D. Avola
Bardh Prenkaj
Federico Fontana
Luigi Cinque
AI4TS
43
2
0
02 Jul 2024
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He
Dongfu Jiang
Ge Zhang
Max W.F. Ku
Achint Soni
...
Yaswanth Narsupalli
Rongqi Fan
Zhiheng Lyu
Yuchen Lin
Wenhu Chen
EGVM
VGen
ALM
48
42
0
21 Jun 2024
CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification
Shuang Li
Jiaxu Leng
Guozhang Li
Ji Gan
Haosheng chen
Xinbo Gao
60
1
0
13 Jun 2024
Image Captioning via Dynamic Path Customization
Yiwei Ma
Jiayi Ji
Xiaoshuai Sun
Yiyi Zhou
Xiaopeng Hong
Yongjian Wu
Rongrong Ji
34
0
0
01 Jun 2024
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao
Haoran Tang
Jinfa Huang
Peng Jin
Can Zhang
Ruyang Liu
Long Chen
Xiaodan Liang
Li-ming Yuan
Ge Li
101
11
0
29 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
68
5
0
26 May 2024
An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
CLIP
39
1
0
25 May 2024
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Haonan Zhang
Pengpeng Zeng
Lianli Gao
Jingkuan Song
Yihang Duan
Xinyu Lyu
Hengtao Shen
VLM
CLIP
40
2
0
21 May 2024
Open-Vocabulary Spatio-Temporal Action Detection
Tao Wu
Shuqiu Ge
Jie Qin
Gangshan Wu
Limin Wang
ObjD
28
5
0
17 May 2024
GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting
Haodong Chen
Yongle Huang
Haojian Huang
Xiangsheng Ge
Dian Shao
DiffM
42
11
0
13 May 2024
X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation
Yiwei Ma
Zhekai Lin
Jiayi Ji
Yijun Fan
Xiaoshuai Sun
Rongrong Ji
34
7
0
02 May 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
44
3
0
26 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
45
1
0
22 Apr 2024
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior
Zichen Liu
Yihao Meng
Ouyang Hao
Yue Yu
Bolin Zhao
Daniel Cohen-Or
Huamin Qu
DiffM
29
5
0
17 Apr 2024
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Simon Schrodi
David T. Hoffmann
Max Argus
Volker Fischer
Thomas Brox
VLM
58
0
0
11 Apr 2024
DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning
Mengfei Du
Binhao Wu
Jiwen Zhang
Zhihao Fan
Zejun Li
Ruipu Luo
Xuanjing Huang
Zhongyu Wei
33
3
0
02 Apr 2024
A Survey on Large Language Models from Concept to Implementation
Chen Wang
Jin Zhao
Jiaqi Gong
LLMAG
LM&MA
37
3
0
27 Mar 2024
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
Omkar Thawakar
Muzammal Naseer
Rao Muhammad Anwer
Salman Khan
M. Felsberg
Mubarak Shah
Fahad Shahbaz Khan
32
7
0
25 Mar 2024
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Joonmyung Choi
Sanghyeok Lee
Jaewon Chu
Minhyuk Choi
Hyunwoo J. Kim
MoMe
ViT
55
12
0
20 Mar 2024
Text-to-Audio Generation Synchronized with Videos
Shentong Mo
Jing Shi
Yapeng Tian
DiffM
VGen
37
17
0
08 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
86
178
0
29 Feb 2024
Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts
Yugo Kubota
Daichi Haraguchi
Seiichi Uchida
CLIP
VLM
43
1
0
26 Feb 2024
Real-World Robot Applications of Foundation Models: A Review
Kento Kawaharazuka
T. Matsushima
Andrew Gambardella
Jiaxian Guo
Chris Paxton
Andy Zeng
OffRL
VLM
LM&Ro
48
45
0
08 Feb 2024
Visual Objectification in Films: Towards a New AI Task for Video Interpretation
Julie Tores
L. Sassatelli
Hui-Yin Wu
Clement Bergman
Lea Andolfi
...
F. Precioso
Thierry Devars
Magali Guaresi
Virginie Julliard
Sarah Lecossais
38
2
0
24 Jan 2024
On the Efficacy of Text-Based Input Modalities for Action Anticipation
Apoorva Beedu
Karan Samel
Irfan Essa
53
2
0
23 Jan 2024
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
34
23
0
19 Jan 2024
Cross-Modality Perturbation Synergy Attack for Person Re-identification
Yunpeng Gong
Zhun Zhong
Zhiming Luo
Yansong Qu
Rongrong Ji
Min Jiang
AAML
35
20
0
18 Jan 2024
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
S. DarshanSingh
Zeeshan Khan
Makarand Tapaswi
VLM
CLIP
36
3
0
15 Jan 2024
Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos
Rongqin Liang
Yuanman Li
Jiantao Zhou
Xia Li
38
6
0
07 Jan 2024
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Kaibin Tian
Yanhua Cheng
Yi Liu
Xinglin Hou
Quan Chen
Han Li
27
3
0
01 Jan 2024
iKUN: Speak to Trackers without Retraining
Yunhao Du
Cheng Lei
Zhicheng Zhao
Fei Su
VOT
32
12
0
25 Dec 2023
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
Sihan Liu
Yiwei Ma
Xiaoqing Zhang
Haowei Wang
Jiayi Ji
Xiaoshuai Sun
Rongrong Ji
24
38
0
19 Dec 2023
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan
Md. Mohaiminul Islam
Thomas Seidl
Gedas Bertasius
28
3
0
11 Dec 2023
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Rahul Pratap Singh
Bishmoy Paul
Ali Dabouei
Min Xu
22
1
0
10 Dec 2023
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
Xuying Zhang
Bo-Wen Yin
Yuming Chen
Zheng Lin
Yunheng Li
Qibin Hou
Ming-Ming Cheng
CLIP
DiffM
34
7
0
07 Dec 2023
Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Arun V. Reddy
William Paul
Corban Rivera
Ketul Shah
Celso M. de Melo
Rama Chellappa
37
4
0
05 Dec 2023
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
Yiwei Ma
Yijun Fan
Jiayi Ji
Haowei Wang
Xiaoshuai Sun
Guannan Jiang
Annan Shu
Rongrong Ji
16
7
0
30 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
36
29
0
29 Nov 2023
SPOT! Revisiting Video-Language Models for Event Understanding
Gengyuan Zhang
Jinhe Bi
Jindong Gu
Yanyu Chen
Volker Tresp
27
2
0
21 Nov 2023
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal
Yonatan Bitton
Idan Szpektor
Kai-Wei Chang
Aditya Grover
40
14
0
15 Nov 2023
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Konstantin Yakovlev
Gregory Polyakov
I. Alimova
Alexander Podolskiy
A. Bout
Sergey I. Nikolenko
Irina Piontkovskaya
CLIP
16
1
0
14 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
38
15
0
13 Nov 2023
REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning
Maxwell A. Xu
Alexander Moreno
Hui Wei
Benjamin M. Marlin
James M. Rehg
AI4TS
SSL
34
11
0
01 Nov 2023
An Empirical Study of Frame Selection for Text-to-Video Retrieval
Mengxia Wu
Min Cao
Yang Bai
Ziyin Zeng
Chen Chen
Liqiang Nie
Min Zhang
34
3
0
01 Nov 2023
Previous
1
2
3
4
Next