Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.01852
Cited By
v1
v2
v3
v4
v5
v6
v7 (latest)
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
3 October 2023
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
HongFa Wang
Yatian Pang
Wenhao Jiang
Junwu Zhang
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (810★)
Papers citing
"LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment"
50 / 55 papers shown
Title
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Junlin Li
Guodong DU
Jing Li
Sim Kuan Goh
Wenya Wang
...
Fangming Liu
Jing Li
Saleh Alharbi
Daojing He
Min Zhang
MoMe
CLL
118
1
0
21 May 2025
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Thong Nguyen
Zhiyuan Hu
Xu Lin
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
76
0
0
19 May 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
87
0
0
02 May 2025
DMPT: Decoupled Modality-aware Prompt Tuning for Multi-modal Object Re-identification
Minghui Lin
Shu Wang
Xiang Wang
Jianhua Tang
Longbin Fu
Zhengrong Zuo
Nong Sang
VLM
203
0
0
15 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
Kai Zhang
Jinahua Han
Lanqing Hong
Hang Xu
Xuelong Li
MLLM
VLM
473
0
0
08 Apr 2025
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
Saron Samuel
Dan DeGenaro
Jimena Guallar-Blasco
Kate Sanders
Oluwaseun Eisape
...
David Etter
Efsun Kayi
Matthew Wiesner
Kenton W. Murray
Reno Kriz
140
0
0
26 Mar 2025
Continual Multimodal Contrastive Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
CLL
206
2
0
19 Mar 2025
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Jonas Belouadi
Eddy Ilg
Margret Keuper
Hideki Tanaka
Masao Utiyama
Raj Dabre
Steffen Eger
Simone Paolo Ponzetto
139
0
0
14 Mar 2025
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education
Yanhao Jia
Xinyi Wu
Hao Li
Qinglin Zhang
Yuxiao Hu
Shuai Zhao
Wenqi Fan
152
4
0
09 Feb 2025
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin
Emily Ruoyu Liu
Joyce Chuyi Chen
Ines Dormoy
Jinyoung Kim
Samar Khanna
Zhuo Zheng
Stefano Ermon
MLLM
VLM
140
9
0
28 Jan 2025
The "Law" of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities
Yongwei Che
Benjamin Eysenbach
66
1
0
20 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
151
2
0
10 Jan 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
128
2
0
31 Dec 2024
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
Alex Schwing
Yuki Mitsufuji
VGen
261
17
0
19 Dec 2024
Do Language Models Understand Time?
Xi Ding
Lei Wang
272
1
0
18 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
165
3
0
16 Dec 2024
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
SungHeon Jeong
Hanning Chen
Sanggeon Yun
Suhyeon Cho
Wenjun Huang
Xiangjian Liu
Mohsen Imani
144
2
0
04 Dec 2024
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Shufan Li
Konstantinos Kallidromitis
Akash Gokul
Zichun Liao
Yusuke Kato
Kazuki Kozuka
Aditya Grover
VGen
142
7
0
02 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
193
2
0
01 Dec 2024
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
Zongjian Li
Bin Lin
Yang Ye
Liuhan Chen
Xinhua Cheng
Shenghai Yuan
Li-xin Yuan
VGen
DiffM
151
19
0
26 Nov 2024
ReWind: Understanding Long Videos with Instructed Learnable Memory
Anxhelo Diko
Tinghuai Wang
Wassim Swaileh
Shiyan Sun
Ioannis Patras
KELM
VLM
124
1
0
23 Nov 2024
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
Reno Kriz
Kate Sanders
David Etter
Kenton W. Murray
Cameron Carpenter
...
Alexander Martin
Ronald Colaianni
Nolan King
Eugene Yang
Benjamin Van Durme
VGen
145
3
0
15 Oct 2024
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Javier Marin
LRM
126
0
0
06 Oct 2024
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli
Andrey Barsky
Mohamed Ali Souibgui
Artemis LLabres
Marco Bertini
Dimosthenis Karatzas
92
4
0
14 Sep 2024
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
Chunyu Qiang
Wang Geng
Yi Zhao
Ruibo Fu
Tao Wang
...
Chen Zhang
Hao Che
L. Wang
Jianwu Dang
J. Tao
AI4TS
88
0
0
11 Aug 2024
Video-to-Audio Generation with Hidden Alignment
Manjie Xu
Chenxing Li
Yong Ren
Rilin Chen
Yu Gu
Yu Gu
Dong Yu
Dong Yu
DiffM
VGen
78
12
0
10 Jul 2024
RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios
Liming Zheng
Feng Yan
Fanfan Liu
Chengjian Feng
Zhuoliang Kang
Lin Ma
115
2
0
09 Jul 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
287
52
0
23 May 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
145
35
0
07 Apr 2024
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
143
105
0
29 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
75
9
0
23 May 2023
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLM
MLLM
286
950
0
27 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
79
110
0
17 Apr 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
100
166
0
28 Mar 2023
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
126
331
0
06 Dec 2022
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
127
531
0
12 Nov 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
86
153
0
15 Sep 2022
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
64
85
0
01 Apr 2022
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth
Doyeon Kim
Woonghyun Ka
Pyunghwan Ahn
Donggyu Joo
S. Chun
Junmo Kim
MDE
45
126
0
19 Jan 2022
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
70
194
0
19 Nov 2021
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
467
7,757
0
11 Nov 2021
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
399
1,107
0
13 Oct 2021
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
477
10,367
0
17 Jun 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
434
1,127
0
17 Feb 2021
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
127
422
0
14 Nov 2020
Rethinking CNN Models for Audio Classification
Kamalesh Palanisamy
Dipika Singhania
Angela Yao
SSL
70
145
0
22 Jul 2020
VGGSound: A Large-scale Audio-Visual Dataset
Honglie Chen
Weidi Xie
Andrea Vedaldi
Andrew Zisserman
89
582
0
29 Apr 2020
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
115
1,200
0
07 Jun 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai
Zhilin Yang
Yiming Yang
J. Carbonell
Quoc V. Le
Ruslan Salakhutdinov
VLM
250
3,730
0
09 Jan 2019
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
Gunnar Sigurdsson
Abhinav Gupta
Cordelia Schmid
Ali Farhadi
Alahari Karteek
SLR
EgoV
63
170
0
25 Apr 2018
1
2
Next