Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
v1
v2 (latest)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1658★)
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,231 papers shown
Title
Transformer-empowered Multi-modal Item Embedding for Enhanced Image Search in E-Commerce
Chang Liu
Peng Hou
Anxiang Zeng
Hanwen Yu
127
2
0
29 Nov 2023
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei
Yang Chen
Haonan Chen
Hexiang Hu
Ge Zhang
Jie Fu
Alan Ritter
Wenhu Chen
88
70
0
28 Nov 2023
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Jiayun Luo
Siddhesh Khandelwal
Leonid Sigal
Boyang Albert Li
MLLM
VLM
136
8
0
28 Nov 2023
RISAM: Referring Image Segmentation via Mutual-Aware Attention Features
Mengxi Zhang
Yiming Liu
Xiangjun Yin
Huanjing Yue
Jingyu Yang
114
1
0
27 Nov 2023
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
Lingchen Meng
Shiyi Lan
Hengduo Li
Jose M. Alvarez
Zuxuan Wu
Yu-Gang Jiang
VLM
ISeg
MLLM
67
9
0
24 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
81
12
0
22 Nov 2023
Multimodal Large Language Models: A Survey
Jiayang Wu
Wensheng Gan
Zefeng Chen
Shicheng Wan
Philip S. Yu
95
195
0
22 Nov 2023
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Meng Chu
Zhedong Zheng
Wei Ji
Tingyu Wang
Tat-Seng Chua
91
10
0
21 Nov 2023
From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation
Jiaxin Ge
Sanjay Subramanian
Trevor Darrell
Boyi Li
LRM
104
4
0
21 Nov 2023
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
66
14
0
18 Nov 2023
MultiDelete for Multimodal Machine Unlearning
Jiali Cheng
Hadi Amiri
MU
111
9
0
18 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
131
72
0
16 Nov 2023
RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection
Stefanos-Iordanis Papadopoulos
C. Koutlis
Symeon Papadopoulos
P. Petrantonakis
79
10
0
16 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
371
711
0
16 Nov 2023
Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval
Junyang Chen
Hanjiang Lai
VLM
130
15
0
13 Nov 2023
Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples
Ziye Fang
Xin Jiang
Hao Tang
Zechao Li
92
14
0
10 Nov 2023
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Georgios Tziafas
Yucheng Xu
Arushi Goel
Mohammadreza Kasaei
Zhibin Li
Hamidreza Kasaei
89
27
0
09 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
141
21
0
09 Nov 2023
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language
Leonard Salewski
A. Sophia Koepke
Hendrik P. A. Lensch
Zeynep Akata
71
2
0
08 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion
Georgios Pantazopoulos
Malvina Nikandrou
Amit Parekh
Bhathiya Hemanthage
Arash Eshghi
Ioannis Konstas
Verena Rieser
Oliver Lemon
Alessandro Suglia
LM&Ro
77
7
0
07 Nov 2023
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
U. Sahin
Hang Li
Qadeer Ahmad Khan
Daniel Cremers
Volker Tresp
VLM
CoGe
74
14
0
07 Nov 2023
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
Cheng Cheng
Lin Song
Ruoyi Xue
Hang Wang
Hongbin Sun
Yixiao Ge
Ying Shan
VLM
ObjD
116
26
0
07 Nov 2023
Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers
Hai T. Phan
Cindy X. Le
Vu Le
Yihui He
Anh Totti Nguyen
63
3
0
06 Nov 2023
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning
Yiran Li
Junpeng Wang
Prince Osei Aboagye
Michael Yeh
Yan Zheng
Liang Wang
Wei Zhang
Kwan-Liu Ma
75
3
0
02 Nov 2023
Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning
Zhenyu Zhang
Benlu Wang
Weijie Liang
Yizhi Li
Xuechen Guo
Guanhong Wang
Shiyan Li
Gaoang Wang
MedIm
LM&MA
34
9
0
02 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLM
DiffM
103
11
0
01 Nov 2023
CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
A. Fuller
K. Millard
James R. Green
94
72
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
151
44
0
01 Nov 2023
Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data
Antonis Antoniades
Yiyi Yu
Joseph Canzano
William Wang
Spencer L. Smith
AI4CE
120
13
0
31 Oct 2023
Class Incremental Learning with Pre-trained Vision-Language Models
Xialei Liu
Xusheng Cao
Haori Lu
Jia-Wen Xiao
Andrew D. Bagdanov
Ming-Ming Cheng
VLM
87
12
0
31 Oct 2023
SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization
Hao Dong
Ismail Nejjar
Han Sun
Eleni Chatzi
Olga Fink
99
25
0
30 Oct 2023
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval
Youbo Lei
Feifei He
Chen Chen
Yingbin Mo
Sijia Li
Defeng Xie
H. Lu
VLM
87
0
0
30 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
106
2
0
30 Oct 2023
ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense
Kankan Zhou
Eason Lai
Wei Bin Au Yeong
K. Mouratidis
Jing Jiang
ReLM
LRM
VLM
67
20
0
30 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
97
34
0
29 Oct 2023
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad
Reza Azad
Sania Eskandari
Afshin Bozorgpour
Amirhossein Kazerouni
I. Rekik
Dorit Merhof
VLM
MedIm
144
68
0
28 Oct 2023
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
Yuchen Suo
Linchao Zhu
Yi Yang
86
13
0
27 Oct 2023
SynergyNet: Bridging the Gap between Discrete and Continuous Representations for Precise Medical Image Segmentation
Vandan Gorade
Sparsh Mittal
Debesh Jha
Ulas Bagci
69
11
0
26 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
74
7
0
26 Oct 2023
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan
B. Vijaykumar
S. Schulter
Manmohan Chandraker
Yun Fu
ReLM
62
12
0
25 Oct 2023
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
81
2
0
24 Oct 2023
I
2
^2
2
MD: 3D Action Representation Learning with Inter- and Intra-modal Mutual Distillation
Yunyao Mao
Jiajun Deng
Wen-gang Zhou
Zhenbo Lu
Wanli Ouyang
Houqiang Li
VLM
85
1
0
24 Oct 2023
Learning with Noisy Labels Using Collaborative Sample Selection and Contrastive Semi-Supervised Learning
Qing Miao
Xiaohe Wu
Chao Xu
Yanli Ji
Wangmeng Zuo
Yiwen Guo
Zhaopeng Meng
NoLa
85
5
0
24 Oct 2023
Linking Surface Facts to Large-Scale Knowledge Graphs
Gorjan Radevski
Kiril Gashteovski
Chia-Chien Hung
Carolin (Haas) Lawrence
Goran Glavaš
HILM
58
3
0
23 Oct 2023
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang
Yi-Jie Huang
Youcai Zhang
Weiwei Tian
Rui Feng
Yuejie Zhang
Yanchun Xie
Yaqian Li
Lei Zhang
VLM
87
35
0
23 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
82
4
0
20 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
61
0
0
20 Oct 2023
SILC: Improving Vision Language Pretraining with Self-Distillation
Muhammad Ferjad Naeem
Yongqin Xian
Xiaohua Zhai
Lukas Hoyer
Luc Van Gool
F. Tombari
VLM
110
36
0
20 Oct 2023
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
Yimu Wang
Xiangru Jian
Bo Xue
55
11
0
17 Oct 2023
EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset
Hang Yin
Pinren Lu
Ziang Li
Bin Sun
Kan Li
96
0
0
17 Oct 2023
Previous
1
2
3
...
11
12
13
...
23
24
25
Next