Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
Andrei-Alexandru Manea
Jindřich Libovický
VLM
123
0
0
30 Apr 2025
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Shahad Albastaki
Anabia Sohail
I. I. Ganapathi
B. Alawode
Asim Khan
Sajid Javed
Naoufel Werghi
Mohammed Bennamoun
Arif Mahmood
174
0
0
26 Apr 2025
Multimodal graph representation learning for website generation based on visual sketch
Tung D. Vu
Chung Hoang
Truong-Son Hy
3DV
100
0
0
25 Apr 2025
Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation
Zhuang Yu
Shiliang Sun
Jing Zhao
Tengfei Song
Hao Yang
91
0
0
25 Apr 2025
PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
Kai Cui
Jiajian Li
Yang Liu
Xuesong Zhang
Zhenzhen Hu
Ming Wang
82
0
0
24 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Jianfei Chen
Xiaoye Zhu
Yanjie Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
Qingbin Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
78
0
0
24 Apr 2025
Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis
Zichuan Liu
Liming Jiang
Qing Yan
Yumin Jia
Hao Kang
Xin Lu
DiffM
142
0
0
19 Apr 2025
GFT: Gradient Focal Transformer
Boris Kriuk
Simranjit Kaur Gill
Shoaib Aslam
Amir Fakhrutdinov
94
0
0
14 Apr 2025
UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
Yating Liu
Yaowei Li
Xiangyuan Lan
Wenming Yang
Zimo Liu
Q. Liao
86
0
0
14 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Hao Yin
Gunagzong Si
Zilei Wang
494
0
0
14 Apr 2025
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Yongchao Feng
Yajie Liu
Shuai Yang
Wenrui Cai
Jing Zhang
...
Jiahui Lv
Ziqiang Liu
Tengyuan Shi
Qingjie Liu
Yansen Wang
MLLM
VLM
121
2
0
13 Apr 2025
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
Jian Wu
Hao Yang
Xinhua Zeng
Guibing He
Zhe Chen
Zhu Li
Xinming Zhang
Yangyang Ma
Run Fang
Yang Liu
LRM
385
1
0
12 Apr 2025
Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction
Kyoyun Choi
Byungmu Yoon
Soobum Kim
Jonggwon Park
137
1
0
10 Apr 2025
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang
Xuhui Zheng
X. Wu
Chong Peng
Xuezhi Cao
71
2
0
10 Apr 2025
Pose-Aware Weakly-Supervised Action Segmentation
Seth Z. Zhao
Reza Ghoddoosian
Isht Dwivedi
Nakul Agarwal
Behzad Dariush
118
0
0
08 Apr 2025
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
Chunhui Zhang
Li Liu
Jialin Gao
Xin Sun
Hao Wen
Xi Zhou
Shiming Ge
Yucheng Wang
110
1
0
02 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
Jing Liu
Peijie Wang
Jing Tao
Zhou Su
124
2
0
01 Apr 2025
Spingarn's Method and Progressive Decoupling Beyond Elicitable Monotonicity
B. Evens
P. Latafat
Panagiotis Patrinos
229
1
0
01 Apr 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
Slava Voloshynovskiy
DiffM
107
0
0
31 Mar 2025
ViLAaD: Enhancing "Attracting and Dispersing'' Source-Free Domain Adaptation with Vision-and-Language Model
Shuhei Tarashima
Xinqi Shu
Norio Tagawa
VLM
79
0
0
30 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
179
1
0
29 Mar 2025
Efficient Adaptation For Remote Sensing Visual Grounding
Hasan Moughnieh
Mohamad Chalhoub
Hasan Nasrallah
Cristiano Nattero
Paolo Campanella
Giovanni Nico
A. Ghandour
109
0
0
29 Mar 2025
Model Assembly Learning with Heterogeneous Layer Weight Merging
Yi-Kai Zhang
Jin Wang
Xu-Xiang Zhong
De-Chuan Zhan
Han-Jia Ye
MoMe
103
0
0
27 Mar 2025
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Haoqiang Lin
Haokun Wen
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
185
16
0
25 Mar 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Shuo Yang
Siwen Luo
S. Han
Eduard Hovy
LRM
64
6
0
24 Mar 2025
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Suho Yoo
Hyunjong Ok
Jaeho Lee
AuLLM
RALM
103
0
0
21 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
95
1
0
21 Mar 2025
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Zichen Liu
Kunlun Xu
Fuchun Sun
Xu Zou
Yuxin Peng
Jiahuan Zhou
VLM
AI4TS
193
2
0
20 Mar 2025
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Yuchen Ren
Zhengyu Zhao
Chenhao Lin
Bo Yang
Zhe Liu
Jiafei Wu
Chao Shen
ViT
92
2
0
19 Mar 2025
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Hao Yin
Guangzong Si
Zilei Wang
132
1
0
17 Mar 2025
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing
Zilun Zhang
Haozhan Shen
Tiancheng Zhao
Bin Chen
Zian Guan
Yuhao Wang
Xu Jia
Yuhao Wang
Yongheng Shang
Yuxiang Cai
88
0
0
16 Mar 2025
Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation
Yifan Xie
Binkai Ou
Fei Ma
Yaohua Liu
70
0
0
14 Mar 2025
The Power of One: A Single Example is All it Takes for Segmentation in VLMs
Mir Rayat Imtiaz Hossain
Mennatullah Siam
Leonid Sigal
James J. Little
MLLM
VLM
Presented at
ResearchTrend Connect | VLM
on
21 May 2025
230
0
0
13 Mar 2025
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Jiayu Jiang
Changxing Ding
Wentao Tan
Junhong Wang
Jin Tao
Xiangmin Xu
119
1
0
13 Mar 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning
Pengfei Luo
Jingbo Zhou
Tong Xu
Yuan Xia
Linli Xu
Enhong Chen
LRM
151
0
0
13 Mar 2025
Aligning Text to Image in Diffusion Models is Easier Than You Think
J. Lee
Byunghee Cha
Jeongsol Kim
Jong Chul Ye
155
1
0
11 Mar 2025
Debiased Prompt Tuning in Vision-Language Model without Annotations
Chaoquan Jiang
Yunfan Yang
Rui Hu
Jitao Sang
VLM
93
0
0
11 Mar 2025
Decoupled Cross-Modal Alignment Network for Text-RGBT Person Retrieval and A High-Quality Benchmark
Yifei Deng
Zhengyu Chen
Ziheng Xu
Chenglong Li
Jin Tang
84
0
0
11 Mar 2025
Anatomy-Aware Conditional Image-Text Retrieval
Meng Zheng
Jiajin Zhang
Benjamin Planche
Zhongpai Gao
Terrence Chen
Ziyan Wu
MedIm
87
0
0
10 Mar 2025
Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency
Duy Phuong Nguyen
J. P. Muñoz
Tanya Roosta
Ali Jannesari
FedML
102
0
0
10 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang
Yue Song
Georgia Gkioxari
Pietro Perona
VLM
116
0
0
10 Mar 2025
ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis
Xukun Zhou
Fengxin Li
Ming Chen
Yan Zhou
Pengfei Wan
Di Zhang
Yeying Jin
Zhaoxin Fan
Hongyan Liu
Jun He
DiffM
VGen
97
0
0
09 Mar 2025
AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
Wenxin Ma
Xu Zhang
Qingsong Yao
Fenghe Tang
Chenxu Wu
Yongbin Li
Rui Yan
Zihang Jiang
S. Kevin Zhou
VLM
111
3
0
09 Mar 2025
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
Md Azim Khan
A. Gangopadhyay
Jianwu Wang
Robert F. Erbacher
VLM
76
0
0
08 Mar 2025
Data-Efficient Generalization for Zero-shot Composed Image Retrieval
Zining Chen
Zhicheng Zhao
Fei Su
Xiaoqin Zhang
Shijian Lu
VLM
148
0
0
07 Mar 2025
RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems
Biao Ouyang
Yingying Zhang
Hanyin Cheng
Yang Shu
Chenjuan Guo
Bin Yang
Qingsong Wen
L. Fan
Christian S. Jensen
98
1
0
06 Mar 2025
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Jie Xu
Na Zhao
Gang Niu
Masashi Sugiyama
Xiaofeng Zhu
159
0
0
06 Mar 2025
Learning 3D Medical Image Models From Brain Functional Connectivity Network Supervision For Mental Disorder Diagnosis
Xingcan Hu
Wei Wang
Li Xiao
72
0
0
06 Mar 2025
A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug Discovery
Yiheng Zhu
Mingyang Li
Junlong Liu
Kun Fu
Jian Wu
Yue Liu
Mingze Yin
Jieping Ye
Jian Wu
Zehua Wang
143
0
0
06 Mar 2025
BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA
Zhengyang Ji
Shang Gao
Li Liu
Yifan Jia
Yutao Yue
58
0
0
04 Mar 2025
Previous
1
2
3
4
5
...
23
24
25
Next