Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2207.07285
Cited By
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
15 July 2022
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
50 / 168 papers shown
Title
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
WonJun Moon
Cheol-Ho Cho
Woojin Jun
Minho Shim
Taeoh Kim
Inwoong Lee
Dongyoon Wee
Jae-Pil Heo
36
0
0
17 Apr 2025
Image Editing with Diffusion Models: A Survey
Jia Wang
Jie Hu
Xiaoqi Ma
Hanghang Ma
Xiaoming Wei
Enhua Wu
68
0
0
17 Apr 2025
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
E. Peruzzo
Dejia Xu
Xingqian Xu
Humphrey Shi
N. Sebe
DiffM
VGen
56
0
0
09 Apr 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
26
0
0
07 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
41
0
0
04 Apr 2025
Post-processing for Fair Regression via Explainable SVD
Zhiqun Zuo
Ding Zhu
Mohammad Mahdi Khalili
152
0
0
04 Apr 2025
MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
ViT
36
0
0
03 Apr 2025
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Boseung Jeong
Jicheol Park
Sungyeon Kim
Suha Kwak
36
0
0
03 Apr 2025
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
41
0
0
03 Apr 2025
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Thinesh Thiyakesan Ponbagavathi
Alina Roitberg
39
0
0
31 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
92
0
0
26 Mar 2025
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Arun V. Reddy
Alexander Martin
Eugene Yang
Andrew Yates
Kate Sanders
Kenton W. Murray
Reno Kriz
Celso M. De Melo
Benjamin Van Durme
Rama Chellappa
50
1
0
24 Mar 2025
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Zengrong Lin
Zheng Wang
Tianwen Qian
Pan Mu
Sixian Chan
Cong Bai
52
0
0
13 Mar 2025
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Chan hur
Jeong-hun Hong
Dong-hun Lee
Dabin Kang
Semin Myeong
Sang-hyo Park
Hyeyoung Park
58
0
0
07 Mar 2025
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
Haoran Tang
Meng Cao
Jinfa Huang
Ruyang Liu
Peng Jin
Ge Li
Xiaodan Liang
Mamba
96
4
0
24 Feb 2025
CrossOver: 3D Scene Cross-Modal Alignment
S. Sarkar
O. Mikšík
Marc Pollefeys
Daniel Barath
Iro Armeni
3DPC
78
0
0
20 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
71
0
0
09 Feb 2025
Can masking background and object reduce static bias for zero-shot action recognition?
Takumi Fukuzawa
Kensho Hara
Hirokatsu Kataoka
Toru Tamaki
43
0
0
22 Jan 2025
Soft Vision-Based Tactile-Enabled SixthFinger: Advancing Daily Objects Manipulation for Stroke Survivors
Basma B. Hasanen
Mashood M. Mohsan
Abdulaziz Alkayas
F. Renda
Irfan Hussain
36
0
0
12 Jan 2025
Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT
Wen-Dong Jiang
Chih-Yung Chang
Diptendu Sinha Roy
38
0
0
07 Jan 2025
Contrastive Learning from Exploratory Actions: Leveraging Natural Interactions for Preference Elicitation
N. Dennler
S. Nikolaidis
Maja J. Matarić
138
0
0
03 Jan 2025
GFG -- Gender-Fair Generation: A CALAMITA Challenge
Simona Frenda
Andrea Piergentili
Beatrice Savoldi
Marco Madeddu
Martina Rosola
Silvia Casola
Chiara Ferrando
V. Patti
Matteo Negri
L. Bentivogli
37
1
0
31 Dec 2024
To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation
Jungkyu Kim
Kibok Lee
Taeyoung Park
44
0
0
26 Dec 2024
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
Y. Wang
Zhikang Zhang
Jue Wang
D. Fan
Zhenlin Xu
Linda Liu
Xiang Hao
Vimal Bhat
Xinyu Li
VLM
82
1
0
10 Dec 2024
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives
Armin Saghafian
Amirmohammad Izadi
Negin Hashemi Dijujin
M. Baghshah
64
0
0
29 Nov 2024
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Hmrishav Bandyopadhyay
Yi-Zhe Song
DiffM
VGen
30
3
0
16 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
H. Haresamudram
Chi Ian Tang
Sungho Suh
P. Lukowicz
Thomas Ploetz
76
2
0
11 Nov 2024
A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model
Panwen Hu
Nan Xiao
Feifei Li
Yongquan Chen
Rui Huang
VGen
OffRL
50
3
0
07 Nov 2024
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Ruyang Liu
Haoran Tang
Haibo Liu
Yixiao Ge
Ying Shan
Chen Li
Jiankun Yang
VLM
48
5
0
04 Nov 2024
Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval
Zijun Min
Bingshuai Liu
L. Zhang
Jia Song
Jinsong Su
Song He
Xiaochen Bo
OT
65
1
0
04 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo
Yibing Song
25
0
0
30 Oct 2024
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
Nils Blank
Moritz Reuss
Marcel Rühle
Ömer Erdinç Yagmurlu
Fabian Wenzel
Oier Mees
Rudolf Lioutikov
LM&Ro
OffRL
29
4
0
23 Oct 2024
Are Visual-Language Models Effective in Action Recognition? A Comparative Study
Mahmoud Ali
Di Yang
François Brémond
VLM
51
0
0
22 Oct 2024
Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Yu Zhao
Hao Fei
Xiangtai Li
L. Qin
Jiayi Ji
Hongyuan Zhu
Meishan Zhang
M. Zhang
Jianguo Wei
DiffM
29
1
0
20 Oct 2024
Beyond Coarse-Grained Matching in Video-Text Retrieval
Aozhu Chen
Hazel Doughty
Xirong Li
Cees G. M. Snoek
32
0
0
16 Oct 2024
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
Reno Kriz
Kate Sanders
David Etter
Kenton W. Murray
Cameron Carpenter
...
Alexander Martin
Ronald Colaianni
Nolan King
Eugene Yang
Benjamin Van Durme
VGen
41
2
0
15 Oct 2024
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
26
2
0
12 Oct 2024
Exploring Foundation Models in Remote Sensing Image Change Detection: A Comprehensive Survey
Zihan Yu
Tianxiao Li
Yuxin Zhu
Rongze Pan
38
0
0
10 Oct 2024
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval
Yabing Wang
Le Wang
Qiang-feng Zhou
Zhibin Wang
Hao Li
Gang Hua
Wei Tang
33
7
0
30 Sep 2024
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm
Bingqing Zhang
Zhuo Cao
Heming Du
Xin Yu
Xue Li
Jiajun Liu
Sen Wang
VGen
25
0
0
30 Sep 2024
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang
Xiaoye Qu
Tong Zhu
Yu Cheng
41
7
0
28 Sep 2024
Mamba Fusion: Learning Actions Through Questioning
Zhikang Dong
Apoorva Beedu
Jason Sheinkopf
Irfan Essa
Mamba
70
2
0
17 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye
Hanane Azzag
M. Lebbah
ObjD
32
0
0
17 Sep 2024
A Novel Dataset for Video-Based Autism Classification Leveraging Extra-Stimulatory Behavior
Manuel Serna-Aguilera
Xuan-Bac Nguyen
Han-Seok Seo
Khoa Luu
41
1
0
06 Sep 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Yiwei Ma
Jiayi Ji
Ke Ye
Weihuang Lin
Zhibin Wang
Yonghan Zheng
Qiang-feng Zhou
Xiaoshuai Sun
Rongrong Ji
46
5
0
26 Aug 2024
Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them
H. Haresamudram
Apoorva Beedu
Mashfiqui Rabbi
Sankalita Saha
Irfan Essa
Thomas Ploetz
31
4
0
21 Aug 2024
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
Subash Khanal
Eric Xing
S. Sastry
A. Dhakal
Zhexiao Xiong
Adeel Ahmad
Nathan Jacobs
36
2
0
13 Aug 2024
Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
Geuntaek Lim
Hyunwoo Kim
Joonsoo Kim
Yukyung Choi
31
0
0
12 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
45
0
0
05 Aug 2024
3D-GRES: Generalized 3D Referring Expression Segmentation
Changli Wu
Yihang Liu
Jiayi Ji
Yiwei Ma
Haowei Wang
Gen Luo
Henghui Ding
Xiaoshuai Sun
Rongrong Ji
36
7
0
30 Jul 2024
1
2
3
4
Next