ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
  Classification
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
87
5
0
08 Jan 2024
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
  Audio-Video Classification
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
60
4
0
08 Jan 2024
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes
  Interactively
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
Haobo Yuan
Xiangtai Li
Chong Zhou
Yining Li
Kai Chen
Chen Change Loy
VLM
116
51
0
05 Jan 2024
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
  Highlight Detection
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection
Hao Sun
Mingyao Zhou
Wenjing Chen
Wei Xie
PINN3DGSViT
63
38
0
04 Jan 2024
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
  Multimodal Alignment
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma
Furong Xu
Jian Liu
Ming Yang
Qingpei Guo
VLM
79
3
0
04 Jan 2024
Social Media Ready Caption Generation for Brands
Social Media Ready Caption Generation for Brands
Himanshu Maheshwari
Koustava Goswami
Apoorv Saxena
Balaji Vasan Srinivasan
49
1
0
03 Jan 2024
Enhancing Representation in Medical Vision-Language Foundation Models
  via Multi-Scale Information Extraction Techniques
Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques
Weijian Huang
Cheng Li
Hong-Yu Zhou
Jiarun Liu
Hao Yang
Yong Liang
Guangming Shi
Hairong Zheng
Shanshan Wang
65
2
0
03 Jan 2024
Incorporating Geo-Diverse Knowledge into Prompting for Increased
  Geographical Robustness in Object Recognition
Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
Kyle Buettner
Sina Malakouti
Xiang Lorraine Li
Adriana Kovashka
128
3
0
03 Jan 2024
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label
  Classification
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification
Xueling Zhu
Jian Liu
Dongqi Tang
Jiawei Ge
Weijia Liu
Bo Liu
Jiuxin Cao
VLM
61
1
0
02 Jan 2024
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
Qiuhui Chen
Yi Hong
MedIm
120
2
0
02 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
  Pre-Training
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
Kevin Qinghong Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLMVGen
99
12
0
01 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object
  Detectors
Generating Enhanced Negatives for Training Language-Based Object Detectors
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjDVLM
112
6
0
29 Dec 2023
P2M2-Net: Part-Aware Prompt-Guided Multimodal Point Cloud Completion
P2M2-Net: Part-Aware Prompt-Guided Multimodal Point Cloud Completion
Linlian Jiang
Pan Chen
Ye Wang
Tieru Wu
Rui Ma
3DPC
70
0
0
29 Dec 2023
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation
Zhuohang Dang
Minnan Luo
Chengyou Jia
Guangwen Dai
Xiao Chang
Jingdong Wang
73
8
0
27 Dec 2023
GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal
  Machine Learning Combining Facial Images and Clinical Texts
GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts
Da Wu
Jing Yang
Cong Liu
Tzung-Chien Hsieh
E. Marchi
...
Wendy K. Chung
G. Lyon
Ian D. Krantz
J. Kalish
Kai Wang
56
2
0
23 Dec 2023
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
  Pre-training and Open-Vocabulary Object Detection
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Haozhan Shen
Tiancheng Zhao
Mingwei Zhu
Yuxiang Cai
VLMObjD
175
11
0
22 Dec 2023
Parrot Captions Teach CLIP to Spot Text
Parrot Captions Teach CLIP to Spot Text
Yiqi Lin
Conghui He
Alex Jinpeng Wang
Bin Wang
Weijia Li
Mike Zheng Shou
102
7
0
21 Dec 2023
MetaSegNet: Metadata-collaborative Vision-Language Representation
  Learning for Semantic Segmentation of Remote Sensing Images
MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images
Libo Wang
Sijun Dong
Ying Chen
Xiaoliang Meng
Shenghui Fang
Ayman Habib
Songlin Fei
42
5
0
20 Dec 2023
ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training
ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training
Rongsheng Wang
Qingsong Yao
Zihang Jiang
Zhiyang He
Xiaodong Tao
Zihang Jiang
S.Kevin Zhou
MedImVLM
110
6
0
20 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in
  Language-Image Pretraining
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
94
8
0
19 Dec 2023
Expediting Contrastive Language-Image Pretraining via Self-distilled
  Encoders
Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders
Bumsoo Kim
Jinhyung Kim
Yeonsik Jo
S. Kim
VLM
98
4
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLMMLLM
167
36
0
19 Dec 2023
UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray
  Classification
UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification
Tianjie Dai
Ruipeng Zhang
Feng Hong
Jiangchao Yao
Ya Zhang
Yanfeng Wang
114
13
0
18 Dec 2023
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu
Xinyun Zhang
Peng Xu
Peiyu Liao
Xufeng Yao
Bei Yu
VLM
37
0
0
17 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
M. Volkovs
125
3
0
15 Dec 2023
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards
  Universal Interpretation for Earth Observation Imagery
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo
Jiangwei Lao
Bo Dang
Yingying Zhang
Lei Yu
...
Jian Wang
Jingdong Chen
Ming Yang
Yongjun Zhang
Yansheng Li
152
129
0
15 Dec 2023
Text-Guided Face Recognition using Multi-Granularity Cross-Modal
  Contrastive Learning
Text-Guided Face Recognition using Multi-Granularity Cross-Modal Contrastive Learning
Md Golam Moula Mehedi Hasan
S. Sami
Nasser M. Nasrabadi
65
6
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIPVLM
57
4
0
14 Dec 2023
Planning and Rendering: Towards End-to-End Product Poster Generation
Planning and Rendering: Towards End-to-End Product Poster Generation
Zhaochen Li
Fengheng Li
Wei Feng
Honghe Zhu
An Liu
...
Xin Zhu
Jun-Jun Shen
Zhangang Lin
Jingping Shao
Zhenglu Yang
DiffM
63
2
0
14 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
101
15
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human
  Pathology
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedImLM&MA
86
23
0
13 Dec 2023
Cross-modal Contrastive Learning with Asymmetric Co-attention Network
  for Video Moment Retrieval
Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval
Love Panta
Prashant Shrestha
Brabeem Sapkota
Amrita Bhattarai
Suresh Manandhar
Anand Kumar Sah
92
5
0
12 Dec 2023
Hallucination Augmented Contrastive Learning for Multimodal Large
  Language Model
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Chaoya Jiang
Haiyang Xu
Mengfan Dong
Jiaxing Chen
Wei Ye
Mingshi Yan
Qinghao Ye
Ji Zhang
Fei Huang
Shikun Zhang
VLM
65
61
0
12 Dec 2023
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
Jiashuo Fan
Yaoyuan Liang
Leyao Liu
Shao-Lun Huang
Lei Zhang
119
2
0
11 Dec 2023
Dynamic Weighted Combiner for Mixed-Modal Image Retrieval
Dynamic Weighted Combiner for Mixed-Modal Image Retrieval
Fuxiang Huang
Lei Zhang
Xiaowei Fu
Suqi Song
92
12
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
81
6
0
11 Dec 2023
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting
  Multi-modal Language Models
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models
Shitian Zhao
Zhuowan Li
Yadong Lu
Alan Yuille
Yan Wang
LRM
75
9
0
09 Dec 2023
SA-Attack: Improving Adversarial Transferability of Vision-Language
  Pre-training Models via Self-Augmentation
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation
Bangyan He
Xiaojun Jia
Siyuan Liang
Tianrui Lou
Yang Liu
Xiaochun Cao
AAMLVLM
107
29
0
08 Dec 2023
Cross-BERT for Point Cloud Pretraining
Cross-BERT for Point Cloud Pretraining
Xin Li
Peng Li
Zeyong Wei
Zhe Zhu
Mingqiang Wei
Junhui Hou
Liangliang Nan
J. Qin
H. Xie
F. Wang
SSL3DPC
77
0
0
08 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLMObjDLRMFAtt
93
12
0
07 Dec 2023
OT-Attack: Enhancing Adversarial Transferability of Vision-Language
  Models via Optimal Transport Optimization
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization
Dongchen Han
Xiaojun Jia
Yang Bai
Jindong Gu
Yang Liu
Xiaochun Cao
VLM
86
26
0
07 Dec 2023
Bootstrapping SparseFormers from Vision Foundation Models
Bootstrapping SparseFormers from Vision Foundation Models
Ziteng Gao
Zhan Tong
Kevin Qinghong Lin
Joya Chen
Mike Zheng Shou
52
0
0
04 Dec 2023
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language
  Models
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
Xunguang Wang
Zhenlan Ji
Pingchuan Ma
Zongjie Li
Shuai Wang
MLLM
96
14
0
04 Dec 2023
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
Dixuan Lin
Yi-Xing Peng
Jingke Meng
Wei-Shi Zheng
84
6
0
04 Dec 2023
LightCLIP: Learning Multi-Level Interaction for Lightweight
  Vision-Language Models
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models
Ying Nie
Wei He
Kai Han
Yehui Tang
Tianyu Guo
Fanyi Du
Yunhe Wang
VLM
86
4
0
01 Dec 2023
RTQ: Rethinking Video-language Understanding Based on Image-text Model
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Xiao Wang
Yaoyu Li
Tian Gan
Zheng Zhang
Jingjing Lv
Liqiang Nie
103
8
0
01 Dec 2023
CAST: Cross-Attention in Space and Time for Video Action Recognition
CAST: Cross-Attention in Space and Time for Video Action Recognition
Dongho Lee
Jongseo Lee
Jinwoo Choi
EgoV
110
13
0
30 Nov 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLMMLLM
151
61
0
30 Nov 2023
TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing
TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing
Lianrui Mu
Jianhong Bai
Xiaoxuan He
Jiangnan Ye
Xiaoyu Liang
Yuchen Yang
Jiedong Zhuang
Haoji Hu
94
2
0
30 Nov 2023
GELDA: A generative language annotation framework to reveal visual
  biases in datasets
GELDA: A generative language annotation framework to reveal visual biases in datasets
Krish Kabra
Kathleen M. Lewis
Guha Balakrishnan
VLM
44
1
0
29 Nov 2023
Previous
123...101112...232425
Next