Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Is CLIP the main roadblock for fine-grained open-world perception?
Lorenzo Bianchi
F. Carrara
Nicola Messina
Fabrizio Falchi
VLM
81
4
0
04 Apr 2024
Cross-Modality Gait Recognition: Bridging LiDAR and Camera Modalities for Human Identification
Rui Wang
Chuanfu Shen
M. Marín-Jiménez
George Q. Huang
Shiqi Yu
CVBM
100
6
0
04 Apr 2024
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
Seung-bum Chung
Joohyun Park
Hyewon Kan
Hyeongyeop Kang
CLIP
77
1
0
03 Apr 2024
DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning
Mengfei Du
Binhao Wu
Jiwen Zhang
Zhihao Fan
Zejun Li
Ruipu Luo
Xuanjing Huang
Zhongyu Wei
69
3
0
02 Apr 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
50
5
0
01 Apr 2024
A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys)
Yashar Deldjoo
Zhankui He
Julian McAuley
Anton Korikov
Scott Sanner
Arnau Ramisa
René Vidal
M. Sathiamoorthy
Atoosa Kasirzadeh
Silvia Milano
VLM
152
60
0
31 Mar 2024
MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models
Zebang Cheng
Fuqiang Niu
Yuxiang Lin
Zhi-Qi Cheng
Bowen Zhang
Xiaojiang Peng
85
7
0
31 Mar 2024
Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar
Sreyan Ghosh
S. Sakshi
Utkarsh Tyagi
Dinesh Manocha
CLIP
CoGe
VLM
85
1
0
30 Mar 2024
Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training
Tongkun Su
Jun Li
Xi Zhang
Haibo Jin
Hao Chen
Qiong Wang
Faqin Lv
Baoliang Zhao
Yin Hu
71
0
0
30 Mar 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLM
LRM
102
40
0
28 Mar 2024
RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method
Ming Yan
Yan Zhang
Shuqiang Cai
Shuqi Fan
Xincheng Lin
...
Siqi Shen
Chenglu Wen
Lan Xu
Yuexin Ma
Cheng-Yu Wang
75
6
0
28 Mar 2024
FewUser: Few-Shot Social User Geolocation via Contrastive Learning
Menglin Li
Kwan Hui Lim
35
0
0
28 Mar 2024
Text Data-Centric Image Captioning with Interactive Prompts
Yiyu Wang
Hao Luo
Jungang Xu
Yingfei Sun
Fan Wang
VLM
76
0
0
28 Mar 2024
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee
Sanghyuk Chun
Sangdoo Yun
VLM
82
3
0
27 Mar 2024
The Solution for the CVPR 2023 1st foundation model challenge-Track2
Haonan Xu
Yurui Huang
Sishun Pan
Zhihao Guan
Yi Tian Xu
Yang Yang
57
0
0
26 Mar 2024
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
Yabin Zhang
Wen-Qing Zhu
Hui Tang
Zhiyuan Ma
Kaiyang Zhou
Lei Zhang
VLM
85
24
0
26 Mar 2024
Residual-based Language Models are Free Boosters for Biomedical Imaging
Zhixin Lai
Jing Wu
Suiyao Chen
Yucheng Zhou
N. Hovakimyan
MedIm
94
31
0
26 Mar 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yuchen Suo
Fan Ma
Linchao Zhu
Yi Yang
82
24
0
24 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Weihao Ye
Yiyi Zhou
Xiaoshuai Sun
Rongrong Ji
MoE
84
1
0
22 Mar 2024
GTC: GNN-Transformer Co-contrastive Learning for Self-supervised Heterogeneous Graph Representation
Yundong Sun
Dongjie Zhu
Yansong Wang
Zhaoshuo Tian
ViT
SSL
78
18
0
22 Mar 2024
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLM
AI4TS
90
4
0
21 Mar 2024
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing
Ajian Liu
Shuai Xue
Jianwen Gan
Jun Wan
Yanyan Liang
Jiankang Deng
Sergio Escalera
Zhen Lei
VLM
73
27
0
21 Mar 2024
Visually Grounded Speech Models have a Mutual Exclusivity Bias
Leanne Nortje
Dan Oneaţă
Yevgen Matusevych
Herman Kamper
SSL
89
1
0
20 Mar 2024
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
Junho Kim
Yeonju Kim
Yonghyun Ro
LRM
MLLM
68
5
0
20 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
103
13
0
20 Mar 2024
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Sensen Gao
Xiaojun Jia
Xuhong Ren
Ivor Tsang
Qing Guo
AAML
101
19
0
19 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
82
11
0
18 Mar 2024
OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
131
0
0
18 Mar 2024
X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment
Dongjae Shin
Hyunseok Lim
Inho Won
Changsu Choi
Minjun Kim
Seungwoo Song
Hangyeol Yoo
Sangmin Kim
Kyungtae Lim
92
5
0
18 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
62
2
0
16 Mar 2024
Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction
Jiyuan Fu
Zhaoyu Chen
Kaixun Jiang
Haijing Guo
Jiafeng Wang
Shuyong Gao
Wenqiang Zhang
VLM
AAML
81
4
0
16 Mar 2024
Deciphering Hate: Identifying Hateful Memes and Their Targets
E. Hossain
Omar Sharif
M. M. Hoque
S. Preum
80
6
0
16 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
133
3
0
15 Mar 2024
Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
Zhuo Zhi
Ziquan Liu
M. Elbadawi
Adam Daneshmend
Mine Orlu
Abdul Basit
Andreas Demosthenous
Miguel R. D. Rodrigues
90
2
0
14 Mar 2024
Anatomical Structure-Guided Medical Vision-Language Pre-training
Qingqiu Li
Xiaohan Yan
Jilan Xu
Runtian Yuan
Yuejie Zhang
Rui Feng
Quanli Shen
Xiaobo Zhang
Shujun Wang
97
6
0
14 Mar 2024
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
Yuxin Tian
Mouxing Yang
Yunfan Li
Dayiheng Liu
Xingzhang Ren
Xiaocui Peng
Jiancheng Lv
VLM
73
0
0
13 Mar 2024
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Long Lan
Fengxiang Wang
Shuyan Li
Xiangtao Zheng
Zengmao Wang
Xinwang Liu
VLM
84
9
0
13 Mar 2024
REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence
Ruochen Zheng
Jiahao Hong
Changxin Gao
Nong Sang
75
1
0
13 Mar 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Haokun Lin
Haoli Bai
Zhili Liu
Lu Hou
Muyi Sun
Linqi Song
Ying Wei
Zhenan Sun
CLIP
VLM
94
17
0
12 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh-Son To
Johan Verjans
128
14
0
12 Mar 2024
VideoMamba: State Space Model for Efficient Video Understanding
Kunchang Li
Xinhao Li
Yi Wang
Yinan He
Yali Wang
Limin Wang
Yu Qiao
Mamba
67
214
0
11 Mar 2024
DiaLoc: An Iterative Approach to Embodied Dialog Localization
Chao Zhang
Mohan Li
Ignas Budvytis
Stephan Liwicki
81
2
0
11 Mar 2024
Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval
Hailang Huang
Zhijie Nie
Ziqiao Wang
Ziyu Shang
67
13
0
08 Mar 2024
MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder
Lei Li
Tianfang Zhang
Xinglin Zhang
Jiaqi Liu
Bingqi Ma
Yan-chun Luo
Tao Chen
MedIm
81
0
0
07 Mar 2024
Effectiveness Assessment of Recent Large Vision-Language Models
Yao Jiang
Xinyu Yan
Ge-Peng Ji
Keren Fu
Meijun Sun
Huan Xiong
Deng-Ping Fan
Fahad Shahbaz Khan
125
17
0
07 Mar 2024
Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation
Gang Liu
Hongyang Li
Zerui He
Shenjun Zhong
MedIm
40
1
0
05 Mar 2024
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing
Zhiyuan Chang
Mingyang Li
Junjie Wang
Cheng Li
Qing Wang
58
0
0
05 Mar 2024
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Iryna Hartsock
Ghulam Rasool
102
81
0
04 Mar 2024
Self-Supervised Facial Representation Learning with Facial Region Awareness
Zheng Gao
Ioannis Patras
SSL
93
11
0
04 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
140
3
0
04 Mar 2024
Previous
1
2
3
...
8
9
10
...
23
24
25
Next