Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.05665
Cited By
ImageBind: One Embedding Space To Bind Them All
9 May 2023
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ImageBind: One Embedding Space To Bind Them All"
50 / 172 papers shown
Title
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
22
0
0
16 May 2025
DSADF: Thinking Fast and Slow for Decision Making
Alex Zhihao Dou
Dongfei Cui
Jun Yan
Wei Wang
Benteng Chen
Haoming Wang
Zeke Xie
Shufei Zhang
OffRL
43
0
0
13 May 2025
ALFEE: Adaptive Large Foundation Model for EEG Representation
Wei Xiong
Junming Lin
Jiangtong Li
Jie Li
Changjun Jiang
39
0
0
07 May 2025
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
63
0
0
05 May 2025
TxP: Reciprocal Generation of Ground Pressure Dynamics and Activity Descriptions for Improving Human Activity Recognition
L. Ray
Lars Krupp
Vitor Fortes Rey
Bo Zhou
Sungho Suh
Paul Lukowicz
AI4CE
156
0
0
04 May 2025
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Gabriel Sarch
Balasaravanan Thoravi Kumaravel
Sahithya Ravi
Vibhav Vineet
A. D. Wilson
206
0
0
02 May 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
44
0
0
02 May 2025
X-Fusion: Introducing New Modality to Frozen Large Language Models
Sicheng Mo
Thao Nguyen
Xun Huang
Siddharth Srinivasan Iyer
Yijun Li
...
Eli Shechtman
Krishna Kumar Singh
Yong Jae Lee
Bolei Zhou
Yuheng Li
77
0
0
29 Apr 2025
DEEMO: De-identity Multimodal Emotion Recognition and Reasoning
Deng Li
Bohao Xing
Xin Liu
Baiqiang Xia
Bihan Wen
Heikki Kälviäinen
VLM
68
0
0
28 Apr 2025
OmniAudio: Generating Spatial Audio from 360-Degree Video
Huadai Liu
Tianyi Luo
Qikai Jiang
Kaicheng Luo
Peiwen Sun
...
Xin Li
Shiliang Zhang
Zhijie Yan
Zhou Zhao
Wei Xue
VGen
58
0
0
21 Apr 2025
DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification
Minghui Lin
Shu Wang
Xiang Wang
Jianhua Tang
Longbin Fu
Zhengrong Zuo
Nong Sang
VLM
47
0
0
15 Apr 2025
FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding
Yasar Abbas Ur Rehman
Kin Wai Lau
Yuyang Xie
Ma Lan
Jiajun Shen
34
0
0
13 Apr 2025
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Team Seawead
Ceyuan Yang
Zhijie Lin
Yang Zhao
Shanchuan Lin
...
Zuquan Song
Zhenheng Yang
Jiashi Feng
Jianchao Yang
Lu Jiang
DiffM
96
2
0
11 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Kaipeng Zhang
MGen
VGen
73
1
0
01 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
Jun Liu
Peijie Wang
Jing Tao
Zhou Su
63
1
0
01 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
54
0
0
29 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
Han Wang
Kai Hu
Liangcai Gao
179
0
0
20 Mar 2025
Continual Multimodal Contrastive Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
CLL
57
0
0
19 Mar 2025
Leveraging Perfect Multimodal Alignment and Gaussian Assumptions for Cross-modal Transfer
Abhi Kamboj
Minh Do
68
0
0
19 Mar 2025
Advancing Medical Representation Learning Through High-Quality Data
Negin Baghbanzadeh
Adibvafa Fallahpour
Yasaman Parhizkar
Franklin Ogidi
Shuvendu Roy
...
Vahid Reza Khazaie
Michael Colacci
Ali Etemad
Arash Afkanpour
Elham Dolatabadi
LM&MA
88
0
0
18 Mar 2025
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Jonas Belouadi
Eddy Ilg
M. Keuper
Hideki Tanaka
Masao Utiyama
Raj Dabre
Steffen Eger
Simone Paolo Ponzetto
52
0
0
14 Mar 2025
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Yu Guo
67
3
0
13 Mar 2025
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
Yizheng Sun
Hao Li
Chang Xu
Hongpeng Zhou
Chenghua Lin
R. Batista-Navarro
Jingyuan Sun
62
0
0
09 Mar 2025
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
Sunghyun Ahn
Youngwan Jo
Kijung Lee
Sein Kwon
Inpyo Hong
Sanghyun Park
63
0
0
06 Mar 2025
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Aayush Dhakal
Srikumar Sastry
Subash Khanal
Adeel Ahmad
Eric Xing
Nathan Jacobs
55
0
0
27 Feb 2025
Knowledge Bridger: Towards Training-free Missing Multi-modality Completion
Guanzhou Ke
Shengfeng He
Xueliang Wang
Bo Wang
Guoqing Chao
Yujie Zhang
Yi Xie
HeXing Su
68
0
0
27 Feb 2025
CrossOver: 3D Scene Cross-Modal Alignment
S. Sarkar
O. Mikšík
Marc Pollefeys
Daniel Barath
Iro Armeni
3DPC
78
0
0
20 Feb 2025
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
64
0
0
20 Feb 2025
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
Ruoxuan Feng
Jiangyu Hu
Wenke Xia
Tianci Gao
Ao Shen
Yuhao Sun
Bin Fang
Di Hu
47
5
0
15 Feb 2025
Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces
Cameron Armijo
Pablo Rivas
44
0
0
09 Feb 2025
The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities
Yongwei Che
Benjamin Eysenbach
41
1
0
20 Jan 2025
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection
Yuanze Li
Haolin Wang
Shihao Yuan
Ming-Yu Liu
Debin Zhao
Yiwen Guo
Chen Xu
Guangming Shi
Wangmeng Zuo
89
30
0
20 Jan 2025
TextToucher: Fine-Grained Text-to-Touch Generation
Jiahang Tu
Hao Fu
Fengyu Yang
Hanbin Zhao
Chao Zhang
Hui Qian
VLM
DiffM
83
9
0
10 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
106
111
0
10 Jan 2025
Detection, Retrieval, and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT
Wen-Dong Jiang
Chih-Yung Chang
Diptendu Sinha Roy
40
0
0
07 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
Hong Li
43
0
0
03 Jan 2025
Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection
Fenfang Tao
G. Xie
Fang Zhao
Xiangbo Shu
44
2
0
23 Dec 2024
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
A. Schwing
Yuki Mitsufuji
VGen
126
12
0
19 Dec 2024
Do Language Models Understand Time?
Xi Ding
Lei Wang
184
0
0
18 Dec 2024
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
87
1
0
18 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
94
1
0
16 Dec 2024
Wearable Accelerometer Foundation Models for Health via Knowledge Distillation
Salar Abbaspourazad
Anshuman Mishra
Joseph D. Futoma
Andrew C. Miller
Ian Shapiro
95
0
0
15 Dec 2024
Mojito: Motion Trajectory and Intensity Control for Video Generation
Xuehai He
Shuohang Wang
Jianwei Yang
Xiaoxia Wu
Yansen Wang
Kuan-Chieh Jackson Wang
Z. Zhan
Olatunji Ruwase
Yelong Shen
Qing Guo
VGen
86
1
0
12 Dec 2024
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
SungHeon Jeong
Hanning Chen
Sanggeon Yun
Suhyeon Cho
Wenjun Huang
Xiangjian Liu
Mohsen Imani
100
1
0
04 Dec 2024
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
90
0
0
25 Nov 2024
Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
Wei Guo
Heng Wang
Jianbo Ma
Weidong Cai
DiffM
93
3
0
23 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
45
0
0
18 Nov 2024
Towards Open-Vocabulary Audio-Visual Event Localization
Jinxing Zhou
Dan Guo
Ruohao Guo
Yuxin Mao
Jingjing Hu
Yiran Zhong
Xiaojun Chang
Ming Wang
VLM
58
4
0
18 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
69
2
0
14 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
H. Haresamudram
Chi Ian Tang
Sungho Suh
P. Lukowicz
Thomas Ploetz
76
2
0
11 Nov 2024
1
2
3
4
Next