Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation
Kang Liu
Zhuoqi Ma
Mengmeng Liu
Zhicheng Jiao
Xiaolu Kang
Qiguang Miao
Kun Xie
MedIm
83
1
0
15 May 2024
Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis
A. Englebert
Anne-Sophie Collin
O. Cornu
Christophe De Vleeschouwer
74
1
0
14 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
93
10
0
14 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
66
0
0
09 May 2024
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie
Yehui Tang
Ning Ding
Zhi-Hong Deng
Kai Han
Yunhe Wang
VLM
111
11
0
09 May 2024
Using Machine Translation to Augment Multilingual Classification
Adam King
83
0
0
09 May 2024
All in One Framework for Multimodal Re-identification in the Wild
He Li
Mang Ye
Ming Zhang
Bo Du
83
11
0
08 May 2024
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
Zheng Zhu
Xiaofeng Wang
Wangbo Zhao
Chen Min
Nianchen Deng
...
Dawei Zhao
Liang Xiao
Jian-jun Zhao
Jiwen Lu
Guan Huang
VGen
LM&Ro
174
48
0
06 May 2024
Knowledge-aware Text-Image Retrieval for Remote Sensing Images
Li Mi
Xianjie Dai
J. Castillo-Navarro
D. Tuia
62
5
0
06 May 2024
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Jiacheng Cheng
Hijung Valentina Shin
Nuno Vasconcelos
Bryan C. Russell
Fabian Caba Heilbron
VLM
60
1
0
06 May 2024
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
CLIP
VLM
175
23
0
30 Apr 2024
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi
Jana Kosecka
67
1
0
29 Apr 2024
TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation
Junhao Cheng
Baiqiao Yin
Kaixin Cai
Minbin Huang
Hanhui Li
...
Yue Li
Yifei Li
Yuhao Cheng
Yiqiang Yan
Xiaodan Liang
DiffM
MLLM
138
13
0
29 Apr 2024
Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment
Tengjun Huang
117
0
0
28 Apr 2024
Multimodal Fusion on Low-quality Data: A Comprehensive Survey
Qingyang Zhang
Yake Wei
Zongbo Han
Huazhu Fu
Xi Peng
...
Qinghua Hu
Cai Xu
Jie Wen
Di Hu
Changqing Zhang
121
31
0
27 Apr 2024
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
103
0
0
27 Apr 2024
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Xuri Ge
Songpei Xu
Fuhai Chen
Jie Wang
Guoxin Wang
Shan An
Joemon M. Jose
3DPC
110
12
0
26 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Philip Torr
Wei Liu
Zhifeng Li
127
13
0
25 Apr 2024
Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models
Jiawei Chen
Dingkang Yang
Yue Jiang
Mingcheng Li
Jinjie Wei
Xiaolu Hou
Lihua Zhang
111
6
0
25 Apr 2024
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
VLM
CoGe
106
0
0
25 Apr 2024
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Eric Slyman
Stefan Lee
Scott D. Cohen
Kushal Kafle
VLM
59
5
0
24 Apr 2024
Leveraging Large Language Models for Multimodal Search
Oriol Barbany
Michael Huang
Xinliang Zhu
Arnab Dhua
92
10
0
24 Apr 2024
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
Young Kyun Jang
Donghyun Kim
Zihang Meng
Dat Huynh
Ser-Nam Lim
79
12
0
23 Apr 2024
Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation
Yikun Zhang
Geyan Ye
Chaohao Yuan
Bo Han
Long-Kai Huang
Jianhua Yao
Wei Liu
Yu Rong
151
3
0
23 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
117
2
0
22 Apr 2024
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Shouwei Ruan
Yinpeng Dong
Hanqing Liu
Yao Huang
Hang Su
Xingxing Wei
VLM
103
1
0
18 Apr 2024
Functional Protein Design with Local Domain Alignment
Chaohao Yuan
Songyou Li
Geyan Ye
Yikun Zhang
Long-Kai Huang
Wenbing Huang
Wei Liu
Jianhua Yao
Yu Rong
91
1
0
18 Apr 2024
Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation
Qiyuan Dai
Sibei Yang
86
9
0
18 Apr 2024
The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models
Cheng Shi
Sibei Yang
VLM
96
4
0
18 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
153
7
0
18 Apr 2024
Vocabulary-free Image Classification and Semantic Segmentation
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
87
3
0
16 Apr 2024
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Zaid Khan
Yun Fu
AAML
83
10
0
16 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
122
8
0
16 Apr 2024
Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels
Amaya Dharmasiri
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
VLM
3DPC
63
1
0
15 Apr 2024
Bridging Vision and Language Spaces with Assignment Prediction
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
VLM
97
7
0
15 Apr 2024
Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder
Chong Peng
Liqiang He
Dan Su
CVBM
107
0
0
15 Apr 2024
Probing the 3D Awareness of Visual Foundation Models
Mohamed El Banani
Amit Raj
Kevis-Kokitsi Maninis
Abhishek Kar
Yuanzhen Li
Michael Rubinstein
Deqing Sun
Leonidas Guibas
Justin Johnson
Varun Jampani
101
86
0
12 Apr 2024
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
112
5
0
11 Apr 2024
How is Visual Attention Influenced by Text Guidance? Database and Model
Yinan Sun
Xiongkuo Min
Huiyu Duan
Guangtao Zhai
167
4
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
80
32
0
10 Apr 2024
Unified Language-driven Zero-shot Domain Adaptation
Senqiao Yang
Zhuotao Tian
Li Jiang
Jiaya Jia
95
10
0
10 Apr 2024
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina
Massimiliano Mancini
Elia Cunegatti
Gaowen Liu
Giovanni Iacca
Elisa Ricci
VLM
79
2
0
08 Apr 2024
Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
37
0
0
08 Apr 2024
Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models
Weiwei Cao
Jianpeng Zhang
Yingda Xia
Tony C. W. Mok
Zi Li
X. Ye
Le Lu
Jian Zheng
Yuxing Tang
Ling Zhang
58
4
0
07 Apr 2024
Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving
Jinlong Li
Baolu Li
Zhengzhong Tu
Xinyu Liu
Qing Guo
Felix Juefei Xu
Runsheng Xu
Hongkai Yu
DiffM
123
26
0
07 Apr 2024
To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO
Zi-Hao Qiu
Siqi Guo
Mao Xu
Tuo Zhao
Lijun Zhang
Tianbao Yang
AI4TS
AI4CE
117
4
0
06 Apr 2024
Label Propagation for Zero-shot Classification with Vision-Language Models
Vladan Stojnić
Yannis Kalantidis
Giorgos Tolias
VLM
71
9
0
05 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
132
28
0
04 Apr 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Vishaal Udandarao
Ameya Prabhu
Adhiraj Ghosh
Yash Sharma
Philip Torr
Adel Bibi
Samuel Albanie
Matthias Bethge
VLM
220
55
0
04 Apr 2024
DeViDe: Faceted medical knowledge for improved medical vision-language pre-training
Haozhe Luo
Ziyu Zhou
Corentin Royer
Anjany Sekuboyina
Bjoern Menze
VLM
ViT
MedIm
101
7
0
04 Apr 2024
Previous
1
2
3
...
7
8
9
...
23
24
25
Next