ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2207.07285
  4. Cited By
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

15 July 2022
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
    CLIP
    VLM
ArXivPDFHTML

Papers citing "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"

50 / 168 papers shown
Title
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
WonJun Moon
Cheol-Ho Cho
Woojin Jun
Minho Shim
Taeoh Kim
Inwoong Lee
Dongyoon Wee
Jae-Pil Heo
36
0
0
17 Apr 2025
Image Editing with Diffusion Models: A Survey
Image Editing with Diffusion Models: A Survey
Jia Wang
Jie Hu
Xiaoqi Ma
Hanghang Ma
Xiaoming Wei
Enhua Wu
68
0
0
17 Apr 2025
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism
E. Peruzzo
Dejia Xu
Xingqian Xu
Humphrey Shi
N. Sebe
DiffM
VGen
56
0
0
09 Apr 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Xiaolun Jing
Genke Yang
Jian Chu
26
0
0
07 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
41
0
0
04 Apr 2025
Post-processing for Fair Regression via Explainable SVD
Post-processing for Fair Regression via Explainable SVD
Zhiqun Zuo
Ding Zhu
Mohammad Mahdi Khalili
152
0
0
04 Apr 2025
MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
ViT
36
0
0
03 Apr 2025
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Boseung Jeong
Jicheol Park
Sungyeon Kim
Suha Kwak
36
0
0
03 Apr 2025
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
41
0
0
03 Apr 2025
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Thinesh Thiyakesan Ponbagavathi
Alina Roitberg
39
0
0
31 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
92
0
0
26 Mar 2025
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Arun V. Reddy
Alexander Martin
Eugene Yang
Andrew Yates
Kate Sanders
Kenton W. Murray
Reno Kriz
Celso M. De Melo
Benjamin Van Durme
Rama Chellappa
50
1
0
24 Mar 2025
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Zengrong Lin
Zheng Wang
Tianwen Qian
Pan Mu
Sixian Chan
Cong Bai
52
0
0
13 Mar 2025
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Chan hur
Jeong-hun Hong
Dong-hun Lee
Dabin Kang
Semin Myeong
Sang-hyo Park
Hyeyoung Park
58
0
0
07 Mar 2025
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
Haoran Tang
Meng Cao
Jinfa Huang
Ruyang Liu
Peng Jin
Ge Li
Xiaodan Liang
Mamba
96
4
0
24 Feb 2025
CrossOver: 3D Scene Cross-Modal Alignment
CrossOver: 3D Scene Cross-Modal Alignment
S. Sarkar
O. Mikšík
Marc Pollefeys
Daniel Barath
Iro Armeni
3DPC
78
0
0
20 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
71
0
0
09 Feb 2025
Can masking background and object reduce static bias for zero-shot action recognition?
Can masking background and object reduce static bias for zero-shot action recognition?
Takumi Fukuzawa
Kensho Hara
Hirokatsu Kataoka
Toru Tamaki
43
0
0
22 Jan 2025
Soft Vision-Based Tactile-Enabled SixthFinger: Advancing Daily Objects Manipulation for Stroke Survivors
Soft Vision-Based Tactile-Enabled SixthFinger: Advancing Daily Objects Manipulation for Stroke Survivors
Basma B. Hasanen
Mashood M. Mohsan
Abdulaziz Alkayas
F. Renda
Irfan Hussain
36
0
0
12 Jan 2025
Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT
Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT
Wen-Dong Jiang
Chih-Yung Chang
Diptendu Sinha Roy
38
0
0
07 Jan 2025
Contrastive Learning from Exploratory Actions: Leveraging Natural Interactions for Preference Elicitation
N. Dennler
S. Nikolaidis
Maja J. Matarić
138
0
0
03 Jan 2025
GFG -- Gender-Fair Generation: A CALAMITA Challenge
GFG -- Gender-Fair Generation: A CALAMITA Challenge
Simona Frenda
Andrea Piergentili
Beatrice Savoldi
Marco Madeddu
Martina Rosola
Silvia Casola
Chiara Ferrando
V. Patti
Matteo Negri
L. Bentivogli
37
1
0
31 Dec 2024
To Predict or Not To Predict? Proportionally Masked Autoencoders for
  Tabular Data Imputation
To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation
Jungkyu Kim
Kibok Lee
Taeyoung Park
44
0
0
26 Dec 2024
GEXIA: Granularity Expansion and Iterative Approximation for Scalable
  Multi-grained Video-language Learning
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
Y. Wang
Zhikang Zhang
Jue Wang
D. Fan
Zhenlin Xu
Linda Liu
Xiang Hao
Vimal Bhat
Xinyu Li
VLM
82
1
0
10 Dec 2024
CAREL: Instruction-guided reinforcement learning with cross-modal
  auxiliary objectives
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives
Armin Saghafian
Amirmohammad Izadi
Negin Hashemi Dijujin
M. Baghshah
64
0
0
29 Nov 2024
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Hmrishav Bandyopadhyay
Yi-Zhe Song
DiffM
VGen
30
3
0
16 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
H. Haresamudram
Chi Ian Tang
Sungho Suh
P. Lukowicz
Thomas Ploetz
76
2
0
11 Nov 2024
A Reinforcement Learning-Based Automatic Video Editing Method Using
  Pre-trained Vision-Language Model
A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model
Panwen Hu
Nan Xiao
Feifei Li
Yongquan Chen
Rui Huang
VGen
OffRL
50
3
0
07 Nov 2024
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Ruyang Liu
Haoran Tang
Haibo Liu
Yixiao Ge
Ying Shan
Chen Li
Jiankun Yang
VLM
48
5
0
04 Nov 2024
Exploring Optimal Transport-Based Multi-Grained Alignments for
  Text-Molecule Retrieval
Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval
Zijun Min
Bingshuai Liu
L. Zhang
Jia Song
Jinsong Su
Song He
Xiaochen Bo
OT
65
1
0
04 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo
Yibing Song
25
0
0
30 Oct 2024
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation
  Models
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
Nils Blank
Moritz Reuss
Marcel Rühle
Ömer Erdinç Yagmurlu
Fabian Wenzel
Oier Mees
Rudolf Lioutikov
LM&Ro
OffRL
29
4
0
23 Oct 2024
Are Visual-Language Models Effective in Action Recognition? A
  Comparative Study
Are Visual-Language Models Effective in Action Recognition? A Comparative Study
Mahmoud Ali
Di Yang
François Brémond
VLM
51
0
0
22 Oct 2024
Synergistic Dual Spatial-aware Generation of Image-to-Text and
  Text-to-Image
Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Yu Zhao
Hao Fei
Xiangtai Li
L. Qin
Jiayi Ji
Hongyuan Zhu
Meishan Zhang
M. Zhang
Jianguo Wei
DiffM
29
1
0
20 Oct 2024
Beyond Coarse-Grained Matching in Video-Text Retrieval
Beyond Coarse-Grained Matching in Video-Text Retrieval
Aozhu Chen
Hazel Doughty
Xirong Li
Cees G. M. Snoek
32
0
0
16 Oct 2024
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
Reno Kriz
Kate Sanders
David Etter
Kenton W. Murray
Cameron Carpenter
...
Alexander Martin
Ronald Colaianni
Nolan King
Eugene Yang
Benjamin Van Durme
VGen
41
2
0
15 Oct 2024
Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
26
2
0
12 Oct 2024
Exploring Foundation Models in Remote Sensing Image Change Detection: A
  Comprehensive Survey
Exploring Foundation Models in Remote Sensing Image Change Detection: A Comprehensive Survey
Zihan Yu
Tianxiao Li
Yuxin Zhu
Rongze Pan
38
0
0
10 Oct 2024
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval
Yabing Wang
Le Wang
Qiang-feng Zhou
Zhibin Wang
Hao Li
Gang Hua
Wei Tang
33
7
0
30 Sep 2024
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm
Bingqing Zhang
Zhuo Cao
Heming Du
Xin Yu
Xue Li
Jiajun Liu
Sen Wang
VGen
25
0
0
30 Sep 2024
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified
  Multiplet Upcycling
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang
Xiaoye Qu
Tong Zhu
Yu Cheng
41
7
0
28 Sep 2024
Mamba Fusion: Learning Actions Through Questioning
Mamba Fusion: Learning Actions Through Questioning
Zhikang Dong
Apoorva Beedu
Jason Sheinkopf
Irfan Essa
Mamba
70
2
0
17 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of
  Modalities
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye
Hanane Azzag
M. Lebbah
ObjD
32
0
0
17 Sep 2024
A Novel Dataset for Video-Based Autism Classification Leveraging
  Extra-Stimulatory Behavior
A Novel Dataset for Video-Based Autism Classification Leveraging Extra-Stimulatory Behavior
Manuel Serna-Aguilera
Xuan-Bac Nguyen
Han-Seok Seo
Khoa Luu
41
1
0
06 Sep 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Yiwei Ma
Jiayi Ji
Ke Ye
Weihuang Lin
Zhibin Wang
Yonghan Zheng
Qiang-feng Zhou
Xiaoshuai Sun
Rongrong Ji
46
5
0
26 Aug 2024
Limitations in Employing Natural Language Supervision for Sensor-Based
  Human Activity Recognition -- And Ways to Overcome Them
Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them
H. Haresamudram
Apoorva Beedu
Mashfiqui Rabbi
Sankalita Saha
Irfan Essa
Thomas Ploetz
31
4
0
21 Aug 2024
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot
  Soundscape Mapping
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
Subash Khanal
Eric Xing
S. Sastry
A. Dhakal
Zhexiao Xiong
Adeel Ahmad
Nathan Jacobs
36
2
0
13 Aug 2024
Probabilistic Vision-Language Representation for Weakly Supervised
  Temporal Action Localization
Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
Geuntaek Lim
Hyunwoo Kim
Joonsoo Kim
Yukyung Choi
31
0
0
12 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a
  Vision-Language Benchmark
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
45
0
0
05 Aug 2024
3D-GRES: Generalized 3D Referring Expression Segmentation
3D-GRES: Generalized 3D Referring Expression Segmentation
Changli Wu
Yihang Liu
Jiayi Ji
Yiwei Ma
Haowei Wang
Gen Luo
Henghui Ding
Xiaoshuai Sun
Rongrong Ji
36
7
0
30 Jul 2024
1234
Next