Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory
Ting Lei
Fabian Caba
Qingchao Chen
Hailin Jin
Yuxin Peng
Yang Liu
VLM
101
19
0
07 Sep 2023
Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices
Bojia Zi
Xianbiao Qi
Lingzhi Wang
Jianan Wang
Kam-Fai Wong
Lei Zhang
96
47
0
05 Sep 2023
Dual Relation Alignment for Composed Image Retrieval
Xintong Jiang
Yaxiong Wang
Yujiao Wu
Ming Wang
Xueming Qian
50
6
0
05 Sep 2023
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Taehoon Kim
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Mark A Marsden
...
Yujin Wang
Yimu Wang
Tiancheng Gu
Xingchang Lv
Mingmao Sun
VLM
132
6
0
05 Sep 2023
MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval
Zijun Long
George Killick
R. McCreadie
Gerardo Aragon Camarasa
70
2
0
04 Sep 2023
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Qiong Wu
Wei Yu
Yiyi Zhou
Shubin Huang
Xiaoshuai Sun
Rongrong Ji
VLM
86
7
0
04 Sep 2023
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
Cheng Shi
Sibei Yang
VLM
90
21
0
03 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
75
5
0
30 Aug 2023
CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
75
21
0
28 Aug 2023
FIRE: Food Image to REcipe generation
P. Chhikara
Dhiraj Chaurasia
Yifan Jiang
Omkar Masur
Filip Ilievski
81
23
0
28 Aug 2023
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
Haiwen Diao
Bo Wan
Yanzhe Zhang
Xuecong Jia
Huchuan Lu
Long Chen
VLM
81
19
0
28 Aug 2023
Unified and Dynamic Graph for Temporal Character Grouping in Long Videos
Xiujun Shu
Wei Wen
Liangsheng Xu
Ruizhi Qiao
Taian Guo
Hanjun Li
Bei Gan
Tianlin Li
Xing Sun
129
0
0
27 Aug 2023
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
Tianyi Zhang
Zheng Wang
Jin Huang
M. M. Tasnim
Wei Shi
VLM
83
22
0
25 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
196
945
0
24 Aug 2023
DLIP: Distilling Language-Image Pre-training
Huafeng Kuang
Jie Wu
Xiawu Zheng
Ming Li
Xuefeng Xiao
Rui Wang
Min Zheng
Rongrong Ji
VLM
70
4
0
24 Aug 2023
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
Ziyan Yang
Kushal Kafle
Zhe Lin
Scott D. Cohen
Zhihong Ding
Vicente Ordonez
75
1
0
24 Aug 2023
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval
Yuan. Yuan
Yangfan Zhan
Zhitong Xiong
VLM
87
47
0
24 Aug 2023
Trustworthy Representation Learning Across Domains
Ronghang Zhu
Dongliang Guo
Daiqing Qi
Zhixuan Chu
Xiang Yu
Sheng Li
FaML
AI4TS
98
2
0
23 Aug 2023
Progressive Feature Mining and External Knowledge-Assisted Text-Pedestrian Image Retrieval
Huafeng Li
Shedan Yang
Yafei Zhang
Dapeng Tao
Z. Yu
76
3
0
23 Aug 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLM
VLM
MoE
73
10
0
23 Aug 2023
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition
Qitong Wang
Long Zhao
Liangzhe Yuan
Ting Liu
Xi Peng
120
16
0
22 Aug 2023
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
Tao Chen
Zexiong Lin
Hui Li
Jiayi Ji
Yiyi Zhou
Guanbin Li
Rongrong Ji
61
0
0
22 Aug 2023
Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models
Zengxiang Li
Zhaoxiang Hou
Hui Liu
Ying Wang
Tongzhi Li
...
Chao Shi
Che-Sheng Yang
Weishan Zhang
Zelei Liu
Liang Xu
FedML
49
2
0
22 Aug 2023
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Haokun Chen
Yao Zhang
Denis Krompass
Jindong Gu
Volker Tresp
FedML
114
54
0
21 Aug 2023
An Examination of the Compositionality of Large Generative Vision-Language Models
Teli Ma
Rong Li
Junwei Liang
CoGe
79
4
0
21 Aug 2023
Simple Baselines for Interactive Video Retrieval with Questions and Answers
Kaiqu Liang
Samuel Albanie
74
3
0
21 Aug 2023
Generic Attention-model Explainability by Weighted Relevance Accumulation
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
46
1
0
20 Aug 2023
An Empirical Study of CLIP for Text-based Person Search
Min Cao
Yang Bai
Ziyin Zeng
Mang Ye
Min Zhang
VLM
124
47
0
19 Aug 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
111
12
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
74
1
0
18 Aug 2023
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan
Shiwei Zhang
Xiang Wang
Samuel Albanie
Yining Pan
Tao Feng
Jianwen Jiang
Dong Ni
Yingya Zhang
Deli Zhao
VLM
79
40
0
18 Aug 2023
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes
Zehan Wang
Haifeng Huang
Yang Zhao
Ziang Zhang
Zhou Zhao
118
73
0
17 Aug 2023
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Kaicheng Yang
Jiankang Deng
Xiang An
Jiawei Li
Ziyong Feng
Jia Guo
Jing Yang
Tongliang Liu
VLM
CLIP
87
52
0
16 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
119
19
0
16 Aug 2023
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
Rui Cao
Ming Shan Hee
Adriel Kuek
Wen-Haw Chong
Roy Ka-wei Lee
Jing Jiang
VLM
MLLM
56
43
0
16 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
72
23
0
15 Aug 2023
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Hongguang Zhu
Yunchao Wei
Xiaodan Liang
Chunjie Zhang
Yao-Min Zhao
VLM
72
30
0
14 Aug 2023
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
Ziqi Zhou
Shengshan Hu
Minghui Li
Hangtao Zhang
Yechao Zhang
Hai Jin
AAML
127
75
0
14 Aug 2023
Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation
Md Golam Moula Mehedi Hasan
Nasser M. Nasrabadi
CVBM
47
2
0
13 Aug 2023
Taming Self-Training for Open-Vocabulary Object Detection
Shiyu Zhao
S. Schulter
Long Zhao
Zhixing Zhang
Vijay Kumar B.G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
106
12
0
11 Aug 2023
Foundation Model is Efficient Multimodal Multitask Model Selector
Fanqing Meng
Wenqi Shao
Zhanglin Peng
Chong Jiang
Kaipeng Zhang
Yu Qiao
Ping Luo
67
16
0
11 Aug 2023
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation
Kaixin Cai
Pengzhen Ren
Yi Zhu
Hang Xu
Jian-zhuo Liu
Changlin Li
Guangrun Wang
Xiaodan Liang
VLM
81
15
0
09 Aug 2023
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods
Ya Jing
Xuelin Zhu
Xingbin Liu
Qie Sima
Taozheng Yang
Yunhai Feng
Tao Kong
LM&Ro
76
16
0
07 Aug 2023
Exploring Part-Informed Visual-Language Learning for Person Re-Identification
Y. Lin
Cong Liu
Yehansen Chen
Jinshui Hu
Bing Yin
Baocai Yin
Zengfu Wang
179
7
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
133
88
0
03 Aug 2023
UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin Qinghong Lin
Pengchuan Zhang
Joya Chen
Shraman Pramanick
Difei Gao
Alex Jinpeng Wang
Rui Yan
Mike Zheng Shou
104
123
0
31 Jul 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
58
2
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
126
46
0
30 Jul 2023
Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
Aayush Dhakal
Adeel Ahmad
Subash Khanal
Srikumar Sastry
Hannah Kerner
Nathan Jacobs
71
13
0
29 Jul 2023
Previous
1
2
3
...
13
14
15
...
23
24
25
Next