Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Improving Social Media Popularity Prediction with Multiple Post Dependencies
Zhizhen Zhang
Xiao-Zhu Xie
Meng Yang
Ye Tian
Yong Jiang
Yong Cui
57
5
0
28 Jul 2023
BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering
Khiem Vinh Tran
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
ViT
71
2
0
28 Jul 2023
MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Classes and Hard Negative Entities
Yongqian Li
Tingwei Lu
Hai-Tao Zheng
Tianyu Yu
Shulin Huang
Haitao Zheng
Rui Zhang
Jun Yuan
95
11
0
27 Jul 2023
G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
Hongxiang Li
Meng Cao
Xuxin Cheng
Yaowei Li
Zhihong Zhu
Yuexian Zou
114
20
0
26 Jul 2023
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models
Dong Lu
Zhiqiang Wang
Teng Wang
Weili Guan
Hongchang Gao
Feng Zheng
AAML
121
76
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
146
127
0
25 Jul 2023
Spectrum-guided Multi-granularity Referring Video Object Segmentation
Bo Miao
Bennamoun
Yongsheng Gao
Ajmal Mian
VOS
97
41
0
25 Jul 2023
Towards a Visual-Language Foundation Model for Computational Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Ivy Liang
...
Andrew Zhang
L. Le
Georg Gerber
Anil V. Parwani
Faisal Mahmood
VLM
MedIm
110
46
0
24 Jul 2023
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model
Peng Wu
Jing Liu
Xiangteng He
Yuxin Peng
Peng Wang
Yanning Zhang
124
34
0
24 Jul 2023
Improving Viewpoint Robustness for Visual Recognition via Adversarial Training
Shouwei Ruan
Yinpeng Dong
Han Su
Jianteng Peng
Ning Chen
Xingxing Wei
60
7
0
21 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
Yiyuan Zhang
Kaixiong Gong
Kaipeng Zhang
Hongsheng Li
Yu Qiao
Wanli Ouyang
Xiangyu Yue
105
150
0
20 Jul 2023
Multi-Grained Multimodal Interaction Network for Entity Linking
Pengfei Luo
Tong Xu
Shiwei Wu
Chen Zhu
Linli Xu
Enhong Chen
87
11
0
19 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
140
40
0
18 Jul 2023
R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut
Yingjie Niu
Ming Ding
Maoning Ge
Robin Karlsson
Yuxiao Zhang
K. Takeda
ViT
50
3
0
18 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
63
9
0
17 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
92
4
0
15 Jul 2023
PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Zixin Guo
Tong Wang
Selen Pehlivan
Abduljalil Radman
Jorma T. Laaksonen
VLM
62
2
0
14 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
216
1
0
14 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
98
31
0
13 Jul 2023
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle
Abhay Jain
Radu Timofte
Goran Glavaš
VLM
MLLM
123
32
0
13 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
118
100
0
11 Jul 2023
Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering
Pengfei Li
Gang Liu
Jinlong He
Zixu Zhao
Shenjun Zhong
53
37
0
11 Jul 2023
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
119
138
0
11 Jul 2023
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback
Jaskirat Singh
Liang Zheng
107
19
0
10 Jul 2023
Enhancing Cross-lingual Transfer via Phonemic Transcription Integration
Hoang Nguyen
Chenwei Zhang
Tao Zhang
Eugene Rohrbaugh
Philip S. Yu
78
7
0
10 Jul 2023
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Wei Han
Hui Chen
MingSung Kan
Soujanya Poria
96
1
0
09 Jul 2023
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
Chunhui Zhang
Xin Sun
Li Liu
Yiqian Yang
Qiong Liu
Xiaoping Zhou
Yanfeng Wang
218
17
0
07 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
53
5
0
06 Jul 2023
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Xuanlin Li
Yunhao Fang
Minghua Liu
Z. Ling
Zhuowen Tu
Haoran Su
VLM
95
25
0
06 Jul 2023
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Pratyush Maini
Sachin Goyal
Zachary Chase Lipton
J. Zico Kolter
Aditi Raghunathan
VLM
83
35
0
06 Jul 2023
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng
Hanbo Zhang
Jiani Zheng
Jiangnan Xia
Guoqiang Wei
Yang Wei
Yuchen Zhang
Tao Kong
MLLM
106
79
0
05 Jul 2023
Review of Large Vision Models and Visual Prompt Engineering
Jiaqi Wang
Zheng Liu
Lin Zhao
Zihao Wu
Chong Ma
...
Bao Ge
Yixuan Yuan
Dinggang Shen
Tianming Liu
Shu Zhang
VLM
LRM
155
162
0
03 Jul 2023
Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages
Yasmine Karoui
R. Lebret
Negar Foroutan
Karl Aberer
MLLM
VLM
57
2
0
29 Jun 2023
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
Avinash Madasu
Vasudev Lal
CoGe
102
3
0
28 Jun 2023
Towards Open Vocabulary Learning: A Survey
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjD
VLM
154
151
0
28 Jun 2023
ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration
Jiaqi Ma
Tianheng Cheng
Guoli Wang
Qian Zhang
Xinggang Wang
Lefei Zhang
DiffM
VLM
81
48
0
23 Jun 2023
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLM
LRM
138
611
0
23 Jun 2023
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
Simone Bianco
Luigi Celona
Marco Donzella
Paolo Napoletano
75
20
0
20 Jun 2023
Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang
Kaipeng Zhang
Yuying Ge
Wenqi Shao
Zeyue Xue
Yu Qiao
Ping Luo
DiffM
104
6
0
20 Jun 2023
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Yuxiang Cai
DiffM
VLM
146
67
0
20 Jun 2023
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
Fan Liu
Delong Chen
Zhan-Rong Guan
Xiaocong Zhou
Jiale Zhu
Qiaolin Ye
Liyong Fu
Jun Zhou
VLM
170
224
0
19 Jun 2023
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Chun-Hsiao Yeh
Bryan C. Russell
Josef Sivic
Fabian Caba Heilbron
Simon Jenni
VLM
MLLM
101
11
0
16 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLM
CLIP
83
9
0
15 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGe
VLM
61
13
0
15 Jun 2023
Pragmatic Inference with a CLIP Listener for Contrastive Captioning
Jiefu Ou
Benno Krojer
Daniel Fried
91
6
0
15 Jun 2023
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training
Chong Liu
Yuqi Zhang
Hongsong Wang
Weihua Chen
F. Wang
Yan Huang
Yixing Shen
Liang Wang
73
28
0
15 Jun 2023
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Kelly O. Marshall
Minh Pham
Ameya Joshi
Anushrut Jignasu
Aditya Balu
Adarsh Krishnamurthy
A. Hegde
CLIP
60
3
0
14 Jun 2023
Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training
Alyssa Huang
Peihan Liu
Ryumei Nakada
Linjun Zhang
Wanrong Zhang
VLM
141
6
0
13 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
112
60
0
13 Jun 2023
Previous
1
2
3
...
14
15
16
...
23
24
25
Next