Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
FaML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,195 papers shown
Title
Image-text matching for large-scale book collections
Artemis LLabres
Arka Ujjal Dey
Dimosthenis Karatzas
Ernest Valveny
18
0
0
29 Jul 2024
WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
Jingjing Wu
Zhengyao Fang
Pengyuan Lyu
Chengquan Zhang
Fanglin Chen
Guangming Lu
Wenjie Pei
50
2
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
37
0
0
28 Jul 2024
Data Processing Techniques for Modern Multimodal Models
Yinheng Li
Han Ding
Hang Chen
VLM
31
0
0
27 Jul 2024
Diffusion Models for Multi-Task Generative Modeling
Changyou Chen
Han Ding
Bunyamin Sisman
Yi Tian Xu
Ouye Xie
Benjamin Z. Yao
Son Dinh Tran
Belinda Zeng
DiffM
45
4
0
24 Jul 2024
Selective Vision-Language Subspace Projection for Few-shot CLIP
Xingyu Zhu
Beier Zhu
Yi Tan
Shuo Wang
Yanbin Hao
Hanwang Zhang
VLM
46
3
0
24 Jul 2024
LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies
Jia Shi
Gautam Gare
Jinjin Tian
Siqi Chai
Zhiqiu Lin
Arun Vasudevan
Di Feng
Francesco Ferroni
Shu Kong
VLM
OODD
OOD
55
3
0
22 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
57
23
0
22 Jul 2024
Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders
Laura Niss
Kevin Vogt-Lowell
Theodoros Tsiligkaridis
VLM
36
0
0
22 Jul 2024
Spatial-Temporal Cross-View Contrastive Pre-training for Check-in Sequence Representation Learning
Letian Gong
Huaiyu Wan
S. Guo
Xiucheng Li
Yan Lin
Erwen Zheng
Tianyi Wang
Zeyu Zhou
Youfang Lin
AI4TS
51
1
0
22 Jul 2024
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective
Mariya Hendriksen
Shuo Zhang
R. Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke
54
0
0
21 Jul 2024
Rethinking Domain Adaptation and Generalization in the Era of CLIP
Ruoyu Feng
Tao Yu
Xin Jin
Xiaoyuan Yu
Lei Xiao
Zhibo Chen
VLM
34
1
0
21 Jul 2024
HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation
Zezeng Li
Weimin Wang
WenHai Li
Na Lei
Xianfeng Gu
OT
DiffM
33
0
0
19 Jul 2024
Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2
Chun Xu
En-Wei Sun
36
0
0
19 Jul 2024
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
Xiaoyu Zhu
Hao Zhou
Pengfei Xing
Long Zhao
Hao Xu
Junwei Liang
Alex Hauptmann
Ting Liu
Andrew C. Gallagher
DiffM
62
4
0
18 Jul 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
21
0
0
18 Jul 2024
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan
Chaofeng Chen
Yiping Ke
Xinjiang Wang
Xue Jiang
Wayne Zhang
VLM
47
24
0
17 Jul 2024
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Naoya Sogi
Takashi Shibata
Makoto Terao
VLM
35
1
0
17 Jul 2024
Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge
Kang Shen
Xuxiong Liu
Boyan Wang
Jun Yao
Xin Liu
Yujie Guan
Yu Wang
Gengchen Li
Xiao Sun
CVBM
41
2
0
17 Jul 2024
Cross-Modal Augmentation for Few-Shot Multimodal Fake News Detection
Ye Jiang
Taihang Wang
Xiaoman Xu
Yimin Wang
Xingyi Song
Diana Maynard
36
2
0
16 Jul 2024
PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation
Renjie Lu
Jingke Meng
Wei-Shi Zheng
36
3
0
16 Jul 2024
How and where does CLIP process negation?
Vincent Quantmeyer
Pablo Mosteiro
Albert Gatt
CoGe
29
6
0
15 Jul 2024
Open Vocabulary Multi-Label Video Classification
Rohit Gupta
Mamshad Nayeem Rizve
Jayakrishnan Unnikrishnan
Ashish Tawari
Son Tran
Mubarak Shah
Benjamin Z. Yao
Trishul Chilimbi
VLM
67
1
0
12 Jul 2024
15M Multimodal Facial Image-Text Dataset
Dawei Dai
Yutang Li
Yingge Liu
Mingming Jia
Zhang YuanHui
Guoyin Wang
VLM
31
7
0
11 Jul 2024
Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement
Zijie Yue
Miaojing Shi
Hanli Wang
Shuai Ding
Qijun Chen
Shanlin Yang
42
0
0
11 Jul 2024
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data
Siyi Du
Shaoming Zheng
Yinsong Wang
Wenjia Bai
D. O’Regan
Chen Qin
LMTD
36
4
0
10 Jul 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Zhongang Qi
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
38
2
0
10 Jul 2024
Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition
Daiqing Wu
Dongbao Yang
Huawen Shen
Can Ma
Yu Zhou
45
4
0
09 Jul 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
66
4
0
09 Jul 2024
Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification
Jiaying Shi
Xuetong Xue
Shenghui Xu
VLM
37
0
0
08 Jul 2024
Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition
Zirun Guo
Tao Jin
Zhou Zhao
29
9
0
07 Jul 2024
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen
Yichao Du
Zichen Wen
Yiyang Zhou
Chenhang Cui
...
Jiawei Zhou
Zhuokai Zhao
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVM
MLLM
58
29
0
05 Jul 2024
MARS: Paying more attention to visual attributes for text-based person search
Alex Ergasti
Tomaso Fontanini
Claudio Ferrari
Massimo Bertozzi
Andrea Prati
57
9
0
05 Jul 2024
Visual Grounding with Attention-Driven Constraint Balancing
Weitai Kang
Luowei Zhou
Junyi Wu
Changchang Sun
Yan Yan
45
4
0
03 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
47
2
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
41
9
0
01 Jul 2024
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Yuxuan Wang
Yijun Liu
Fei Yu
Chen Huang
Kexin Li
Zhiguo Wan
Wanxiang Che
VLM
CoGe
35
5
0
01 Jul 2024
From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
Nan Xu
Fei Wang
Sheng Zhang
Hoifung Poon
Muhao Chen
34
6
0
01 Jul 2024
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
Mehar Bhatia
Sahithya Ravi
Aditya Chinchure
EunJeong Hwang
Vered Shwartz
VLM
32
2
0
28 Jun 2024
Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train
Haojun Jiang
Meng Li
Zhenguo Sun
Ning Jia
Yu Sun
Shaqi Luo
Shiji Song
Gao Huang
49
2
0
28 Jun 2024
Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
Malvina Nikandrou
Georgios Pantazopoulos
Ioannis Konstas
Alessandro Suglia
32
0
0
27 Jun 2024
Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models
Yicheng Xu
Yuxin Chen
Jiahao Nie
Yusong Wang
Huiping Zhuang
Manabu Okumura
VLM
CLL
46
6
0
27 Jun 2024
Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
H. Kerdegari
Kyle Higgins
Dennis Veselkov
I. Laponogov
I. Poļaka
...
Junior Andrea Pescino
M. Leja
M. Dinis-Ribeiro
T. F. Kanonnikoff
Kirill Veselkov
35
3
0
26 Jun 2024
MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation
Liuyi Wang
Zongtao He
Mengjiao Shen
Jingwei Yang
Chengju Liu
Qijun Chen
VLM
33
2
0
25 Jun 2024
DN-CL: Deep Symbolic Regression against Noise via Contrastive Learning
Jingyi Liu
Yanjie Li
Lina Yu
Min Wu
Weijun Li
Wenqiang Li
Meilan Hao
Yusong Deng
Shu Wei
49
0
0
21 Jun 2024
Revealing Vision-Language Integration in the Brain with Multimodal Networks
Vighnesh Subramaniam
C. Conwell
Christopher Wang
Gabriel Kreiman
Boris Katz
Ignacio Cases
Andrei Barbu
35
8
0
20 Jun 2024
LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation
Rebecca Salganik
Xiaohao Liu
Yunshan Ma
Jian Kang
Tat-Seng Chua
CLL
46
2
0
20 Jun 2024
Towards a multimodal framework for remote sensing image change retrieval and captioning
Roger Ferrod
Luigi Di Caro
Dino Ienco
24
2
0
19 Jun 2024
Synergizing Foundation Models and Federated Learning: A Survey
Shenghui Li
Fanghua Ye
Meng Fang
Jiaxu Zhao
Yun-Hin Chan
Edith C. -H. Ngai
Thiemo Voigt
AI4CE
57
5
0
18 Jun 2024
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
CoGe
40
9
0
17 Jun 2024
Previous
1
2
3
4
5
6
...
22
23
24
Next