Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA
Zhengyang Ji
Shang Gao
Li Liu
Yifan Jia
Yutao Yue
58
0
0
04 Mar 2025
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Ziyang Zhang
Yang Yu
Yucheng Chen
Xulei Yang
S. Yeo
MedIm
179
2
0
02 Mar 2025
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis
Yun Wang
Jingchen Ni
Yong-Jin Liu
Chun Yuan
Yansong Tang
96
4
0
02 Mar 2025
Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series
Yanan Niu
Roy Sarkis
D. Psaltis
Mario Paolone
Christophe Moser
Luisa Lambertini
131
0
0
28 Feb 2025
MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal Classification
Tianze Zhang
Shu Shen
Chao Chen
116
0
0
27 Feb 2025
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Aayush Dhakal
Srikumar Sastry
Subash Khanal
Adeel Ahmad
Eric Xing
Nathan Jacobs
152
0
0
27 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Chenyang Zhao
Kun Wang
J. H. Hsiao
Antoni B. Chan
CLIP
108
0
0
26 Feb 2025
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models
Jiawei Kong
Hao Fang
Sihang Guo
Chenxi Qing
Bin Chen
Bin Wang
Shu-Tao Xia
AAML
VLM
132
0
0
26 Feb 2025
Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation
Wenxuan Wang
K. Wu
Yujian Betterest Li
Dan Wang
Xinsong Zhang
Qingbin Liu
AI4TS
115
1
0
24 Feb 2025
CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification
Liping Lu
Zihao Fu
Duanfeng Chu
Wei Wang
Bingrong Xu
VLM
102
0
0
24 Feb 2025
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin
Zehao Xiao
Pan Zhou
Shujian Yu
Jiayi Shen
Jan-Jakob Sonke
E. Gavves
177
1
0
24 Feb 2025
Graph Perceiver IO: A General Architecture for Graph Structured Data
Seyun Bae
Hoyoon Byun
Changdae Oh
Yoon-Sik Cho
Kyungwoo Song
GNN
256
3
0
24 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
110
16
0
24 Feb 2025
Understanding the Emergence of Multimodal Representation Alignment
Megan Tjandrasuwita
Chanakya Ekbote
Liu Ziyin
Paul Pu Liang
108
2
0
22 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
529
0
0
21 Feb 2025
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Yuheng Ji
Yue Liu
Zhicheng Zhang
Zhao Zhang
Yuting Zhao
Gang Zhou
Xingwei Zhang
Xinwang Liu
Xiaolong Zheng
VLM
184
4
0
21 Feb 2025
Learning Generalizable Prompt for CLIP with Class Similarity Knowledge
Sehun Jung
Hyang-won Lee
VLM
VPVLM
73
0
0
17 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
97
0
0
09 Feb 2025
Boosting Weak Positives for Text Based Person Search
Akshay Modi
Ashhar Aziz
Nilanjana Chatterjee
A V Subramanyam
142
0
0
29 Jan 2025
sDREAMER: Self-distilled Mixture-of-Modality-Experts Transformer for Automatic Sleep Staging
Jingyuan Chen
Yuan Yao
Mie Anderson
Natalie Hauglund
Celia Kjaerby
Verena Untiet
Maiken Nedergaard
Jiebo Luo
153
2
0
28 Jan 2025
Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap
Srivatsa Mallapragada
Ying Xie
Varsha Rani Chawan
Zeyad Hailat
Yuanbo Wang
108
0
0
28 Jan 2025
BiFold: Bimanual Cloth Folding with Language Guidance
Oriol Barbany
Adrià Colomé
Carme Torras
42
1
0
27 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
83
0
0
20 Jan 2025
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
J. Park
Jungbeom Lee
Jongyoon Song
Sangwon Yu
Dahuin Jung
Sungroh Yoon
122
3
0
19 Jan 2025
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
Pengcheng Zhao
Zhixian He
Fuwei Zhang
Shujin Lin
Fan Zhou
136
2
0
18 Jan 2025
A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval
Weihang Zhang
Jihao Li
Shuoke Li
Ziqing Niu
Jialiang Chen
Wenkai Zhang
VLM
78
0
0
18 Jan 2025
Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data
Omar Mena
Alexandre Kouyoumdjian
Lonni Besancon
Michael Gleicher
I. Viola
Anders Ynnerman
93
0
0
17 Jan 2025
MULTI: Multimodal Understanding Leaderboard with Text and Images
Zichen Zhu
Yang Xu
Lu Chen
Jingkai Yang
Yichuan Ma
...
Yingzi Ma
Situo Zhang
Zihan Zhao
Liangtai Sun
Kai Yu
VLM
116
5
0
08 Jan 2025
Foundations of GenIR
Qingyao Ai
Jingtao Zhan
Yang Liu
126
0
0
06 Jan 2025
GeAR: Generation Augmented Retrieval
Haoyu Liu
Shaohan Huang
Jianfeng Liu
Yuefeng Zhan
H. Sun
Weiwei Deng
Feng Sun
Furu Wei
Qi Zhang
84
1
0
06 Jan 2025
Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion
Hebin Wang
Yangning Li
Hai-Tao Zheng
Hai-Tao Zheng
Wenhao Jiang
Hong-Gee Kim
143
0
0
03 Jan 2025
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
145
30
0
31 Dec 2024
Enhancing Visual Representation for Text-based Person Searching
Wei Shen
Ming Fang
Yuxia Wang
Jiafeng Xiao
Diping Li
Ningyu Zhang
Ling Xu
Weinan Zhang
111
4
0
31 Dec 2024
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios
Ning Liao
Xiaopeng Zhang
Minglu Cao
Junchi Yan
VPVLM
VLM
183
0
0
31 Dec 2024
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
282
5
0
31 Dec 2024
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Yi Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
141
2
0
25 Dec 2024
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding
Zhuo Cao
Bingqing Zhang
Heming Du
Xin Yu
Xue Li
Sen Wang
125
2
0
18 Dec 2024
Bringing Multimodality to Amazon Visual Search System
Xinliang Zhu
Michael Huang
Han Ding
Jinyu Yang
Kelvin Chen
...
Son Dinh Tran
Benjamin Z. Yao
Doug Gray
Anuj Bindal
Arnab Dhua
112
3
0
17 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
143
11
0
17 Dec 2024
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang
Tang Li
Kien X. Nguyen
Xi Peng
182
0
0
17 Dec 2024
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
S. Nagendra
Kashif Rashid
Chaopeng Shen
Daniel Kifer
VLM
143
2
0
16 Dec 2024
Does VLM Classification Benefit from LLM Description Semantics?
Pingchuan Ma
Lennart Rietdorf
Dmytro Kotovenko
Vincent Tao Hu
Bjorn Ommer
VLM
148
1
0
16 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
204
4
0
16 Dec 2024
ViSymRe: Vision-guided Multimodal Symbolic Regression
Da Li
Junping Yin
Jin Xu
Xinxin Li
Juan Zhang
130
1
0
15 Dec 2024
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs
Gorden Liu
Yu Sun
R.-H. Sun
Xin Dong
Hongyu Xiong
Hongyu Xiong
LLMAG
125
1
0
15 Dec 2024
Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation
Yang Yang
Wenjuan Xi
Luping Zhou
Jinhui Tang
148
0
0
14 Dec 2024
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Haoyu Jiang
Zhi-Qi Cheng
Gabriel Moreira
Jiawen Zhu
Jingdong Sun
Bukun Ren
Jun-Yan He
Qi Dai
Xian-Sheng Hua
VLM
142
0
0
14 Dec 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
353
10
0
11 Dec 2024
Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning
Can Yaras
Siyi Chen
Peng Wang
Q. Qu
VLM
89
3
0
10 Dec 2024
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu
Mengyang Wu
Yuzhi Zhao
Jason Chun Lok Li
Weifeng Ou
LRM
SyDa
VLM
129
4
0
09 Dec 2024
Previous
1
2
3
4
5
6
...
23
24
25
Next