Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Weizhen He
Yihe Deng
Shixiang Tang
Qihao Chen
Qingsong Xie
...
Feng Zhu
Rui Zhao
Wanli Ouyang
Donglian Qi
Yunfeng Yan
121
24
0
13 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
96
4
0
12 Jun 2023
Sticker820K: Empowering Interactive Retrieval with Stickers
Sijie Zhao
Yixiao Ge
Zhongang Qi
Lin Song
Xiaohan Ding
Zehua Xie
Ying Shan
62
8
0
12 Jun 2023
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Li Xu
Bo Liu
Ameer Hamza Khan
Lu Fan
Xiao-Ming Wu
LM&MA
65
9
0
10 Jun 2023
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
Fuxiao Liu
Hao Tan
Chris Tensmeyer
CLIP
VLM
99
18
0
09 Jun 2023
Read, look and detect: Bounding box annotation from image-caption pairs
E. Sanchez
ObjD
62
0
0
09 Jun 2023
Modular Visual Question Answering via Code Generation
Sanjay Subramanian
Medhini Narasimhan
Kushal Khangaonkar
Kevin Kaichuang Yang
Arsha Nagrani
Cordelia Schmid
Andy Zeng
Trevor Darrell
Dan Klein
75
51
0
08 Jun 2023
Fine-Grained Visual Prompting
Lingfeng Yang
Yueze Wang
Xiang Li
Xinlong Wang
Jian Yang
ObjD
VLM
115
68
0
07 Jun 2023
MolFM: A Multimodal Molecular Foundation Model
Yi Luo
Kai Yang
Massimo Hong
Xingyi Liu
Zaiqing Nie
78
39
0
06 Jun 2023
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
87
1
0
04 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip Torr
Volker Tresp
VPVLM
VLM
127
21
0
03 Jun 2023
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Qiangchang Wang
Yilong Yin
98
0
0
02 Jun 2023
Revisiting the Role of Language Priors in Vision-Language Models
Zhiqiu Lin
Xinyue Chen
Deepak Pathak
Pengchuan Zhang
Deva Ramanan
VLM
159
27
0
02 Jun 2023
Vocabulary-free Image Classification
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
129
27
0
01 Jun 2023
A Transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics
Hong-Yu Zhou
Yizhou Yu
Chengdi Wang
Shu Zhen Zhang
Yuanxu Gao
Jia Pan
Jun Shao
Guangming Lu
Kang Zhang
Weimin Li
MedIm
91
171
0
01 Jun 2023
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning
Xiao Dong
Runhu Huang
Xiaoyong Wei
Zequn Jie
Jianxing Yu
Jian Yin
Xiaodan Liang
VLM
DiffM
67
1
0
01 Jun 2023
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Shubin Huang
Qiong Wu
Yiyi Zhou
Weijie Chen
Rongsheng Zhang
Xiaoshuai Sun
Rongrong Ji
VLM
VPVLM
LRM
43
0
0
01 Jun 2023
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction
Hejie Cui
Rongmei Lin
Nasser Zalmout
Chenwei Zhang
Jingbo Shang
Carl Yang
Xian Li
VLM
72
4
0
01 Jun 2023
Prompt Algebra for Task Composition
Pramuditha Perera
Matthew Trager
Luca Zancato
Alessandro Achille
Stefano Soatto
VLM
77
8
0
01 Jun 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu
Bei Li
Chenfei Wu
Shao-Yen Tseng
Anahita Bhiwandiwalla
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
AIFin
VLM
70
4
0
31 May 2023
Too Large; Data Reduction for Vision-Language Pre-Training
Alex Jinpeng Wang
Kevin Qinghong Lin
David Junhao Zhang
Stan Weixian Lei
Mike Zheng Shou
VLM
78
24
0
31 May 2023
Chatting Makes Perfect: Chat-based Image Retrieval
Matan Levy
Rami Ben-Ari
N. Darshan
Dani Lischinski
136
16
0
31 May 2023
Joint Adaptive Representations for Image-Language Learning
A. Piergiovanni
A. Angelova
VLM
76
0
0
31 May 2023
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
M. Jehanzeb Mirza
Leonid Karlinsky
Wei Lin
Mateusz Koziñski
Horst Possegger
Rogerio Feris
Horst Bischof
VLM
107
33
0
29 May 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
195
112
0
29 May 2023
HGT: A Hierarchical GCN-Based Transformer for Multimodal Periprosthetic Joint Infection Diagnosis Using CT Images and Text
Ruiyang Li
Fujun Yang
Xianjie Liu
Hon-Yi Shi
75
0
0
29 May 2023
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
Noam Rotstein
David Bensaid
Shaked Brody
Roy Ganz
Ron Kimmel
VLM
81
31
0
28 May 2023
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models
Zhiwei Jia
P. Narayana
Arjun Reddy Akula
G. Pruthi
Haoran Su
Sugato Basu
Varun Jampani
VLM
OffRL
81
4
0
28 May 2023
Learning from Children: Improving Image-Caption Pretraining via Curriculum
Hammad A. Ayyubi
R. Lokesh
Alireza Zareian
Bohong Wu
Shih-Fu Chang
VLM
CLIP
62
2
0
27 May 2023
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Qingqing Cao
Bhargavi Paranjape
Hannaneh Hajishirzi
MLLM
VLM
75
27
0
27 May 2023
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Anyi Rao
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
133
23
0
27 May 2023
MPCHAT: Towards Multimodal Persona-Grounded Conversation
Jaewoo Ahn
Yeda Song
Sangdoo Yun
Gunhee Kim
53
22
0
27 May 2023
Modularized Zero-shot VQA with Pre-trained Models
Rui Cao
Jing Jiang
LRM
89
3
0
27 May 2023
On Evaluating Adversarial Robustness of Large Vision-Language Models
Yunqing Zhao
Tianyu Pang
Chao Du
Xiao Yang
Chongxuan Li
Ngai-Man Cheung
Min Lin
VLM
AAML
MLLM
149
184
0
26 May 2023
CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning
Zhao-Heng Zheng
Haidong Zhu
Ramkant Nevatia
CoGe
91
7
0
26 May 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
57
4
0
26 May 2023
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Zheyuan Liu
Weixuan Sun
Damien Teney
Stephen Gould
92
19
0
25 May 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Chia-Wen Kuo
Z. Kira
83
23
0
25 May 2023
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Zijia Zhao
Longteng Guo
Tongtian Yue
Si-Qing Chen
Shuai Shao
Xinxin Zhu
Zehuan Yuan
Jing Liu
MLLM
111
61
0
25 May 2023
Weakly Supervised Vision-and-Language Pre-training with Relative Representations
Chi Chen
Peng Li
Maosong Sun
Yang Liu
67
2
0
24 May 2023
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Zekun Wang
Jingchang Chen
Wangchunshu Zhou
Haichao Zhu
Jiafeng Liang
Liping Shan
Ming Liu
Dongliang Xu
Qing Yang
Bing Qin
VLM
87
5
0
24 May 2023
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi
Aishwarya Agrawal
57
6
0
24 May 2023
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning
Ahmed Masry
P. Kavehzadeh
Do Xuan Long
Enamul Hoque
Shafiq Joty
LRM
95
113
0
24 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
103
5
0
23 May 2023
Masked Path Modeling for Vision-and-Language Navigation
Zi-Yi Dou
Feng Gao
Nanyun Peng
LM&Ro
81
3
0
23 May 2023
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran
Yashin Dicente Cid
Amal Lahiani
Fabian J. Theis
Tingying Peng
Eldad Klaiman
54
2
0
23 May 2023
DetGPT: Detect What You Need via Reasoning
Renjie Pi
Jiahui Gao
Shizhe Diao
Boyao Wang
Hanze Dong
...
Lewei Yao
Jianhua Han
Hang Xu
Lingpeng Kong Tong Zhang
Tong Zhang
LRM
LM&Ro
86
99
0
23 May 2023
Can Language Models Understand Physical Concepts?
Lei Li
Jingjing Xu
Qingxiu Dong
Ce Zheng
Qi Liu
Lingpeng Kong
Xu Sun
ALM
61
22
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIP
VLM
129
27
0
23 May 2023
UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
Hao Yang
Can Gao
Hao Liu
Xinyan Xiao
Yanyan Zhao
Bing Qin
42
2
0
23 May 2023
Previous
1
2
3
...
15
16
17
...
23
24
25
Next