ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
SimCon Loss with Multiple Views for Text Supervised Semantic
  Segmentation
SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation
Yash J. Patel
Yusheng Xie
Yi Zhu
Srikar Appalaraju
R. Manmatha
75
4
0
07 Feb 2023
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
  Retrieval
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Kuniaki Saito
Kihyuk Sohn
Xiang Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
124
123
0
06 Feb 2023
Contrast with Reconstruct: Contrastive 3D Representation Learning Guided
  by Generative Pretraining
Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
Zekun Qi
Runpei Dong
Guo Fan
Zheng Ge
Xiangyu Zhang
Kaisheng Ma
Li Yi
154
130
0
05 Feb 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Fan Liu
Liqiang Nie
Mohan S. Kankanhalli
79
10
0
04 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
  and Video
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLMVLMMoE
116
171
0
01 Feb 2023
Multimodality Representation Learning: A Survey on Evolution,
  Pretraining and Its Applications
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
101
32
0
01 Feb 2023
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
  Semantic Consistency
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency
Pengzhen Ren
Changlin Li
Hang Xu
Yi Zhu
Guangrun Wang
Jian-zhuo Liu
Xiaojun Chang
Xiaodan Liang
106
45
0
31 Jan 2023
Understanding Self-Distillation in the Presence of Label Noise
Understanding Self-Distillation in the Presence of Label Noise
Rudrajit Das
Sujay Sanghavi
137
17
0
30 Jan 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
447
4,668
0
30 Jan 2023
OvarNet: Towards Open-vocabulary Object Attribute Recognition
OvarNet: Towards Open-vocabulary Object Attribute Recognition
Keyan Chen
Xiaolong Jiang
Yao Hu
Xu Tang
Yan Gao
Jianqi Chen
Weidi Xie
VLMObjD
76
41
0
23 Jan 2023
Learning Open-vocabulary Semantic Segmentation Models From Natural
  Language Supervision
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision
Jilan Xu
Junlin Hou
Yuejie Zhang
Rui Feng
Yi Wang
Yu Qiao
Weidi Xie
VLM
82
87
0
22 Jan 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Floris Weers
Vaishaal Shankar
Angelos Katharopoulos
Yinfei Yang
Tom Gunter
CLIP
54
5
0
19 Jan 2023
CLIPTER: Looking at the Bigger Picture in Scene Text Recognition
CLIPTER: Looking at the Bigger Picture in Scene Text Recognition
Aviad Aberdam
David Bensaid
Alona Golts
Roy Ganz
Oren Nuriel
Royee Tichauer
Shai Mazor
Ron Litman
VLMCLIP
90
13
0
18 Jan 2023
Towards Models that Can See and Read
Towards Models that Can See and Read
Roy Ganz
Oren Nuriel
Aviad Aberdam
Yair Kittenplon
Shai Mazor
Ron Litman
71
13
0
18 Jan 2023
GLIGEN: Open-Set Grounded Text-to-Image Generation
GLIGEN: Open-Set Grounded Text-to-Image Generation
Yuheng Li
Haotian Liu
Qingyang Wu
Fangzhou Mu
Jianwei Yang
Jianfeng Gao
Chunyuan Li
Yong Jae Lee
VLM
148
603
1
17 Jan 2023
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
  Retrieval
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval
Yan Zhang
Zhong Ji
Dingrong Wang
Yanwei Pang
Xuelong Li
VLM
61
23
0
17 Jan 2023
UATVR: Uncertainty-Adaptive Text-Video Retrieval
UATVR: Uncertainty-Adaptive Text-Video Retrieval
Bo Fang
Wenhao Wu
Chang-rui Liu
Yu Zhou
Yuxin Song
Weiping Wang
Min Yang
Xiang Ji
Jingdong Wang
107
57
0
16 Jan 2023
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A
  Reproducibility Study
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Mariya Hendriksen
Svitlana Vakulenko
E. Kuiper
Maarten de Rijke
89
4
0
12 Jan 2023
Toward Building General Foundation Models for Language, Vision, and
  Vision-Language Understanding Tasks
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Xinsong Zhang
Yan Zeng
Jipeng Zhang
Hang Li
VLMAI4CELRM
111
17
0
12 Jan 2023
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
Manh-Duy Nguyen
Binh T. Nguyen
C. Gurrin
VLM
47
5
0
11 Jan 2023
Learning to Exploit Temporal Structure for Biomedical Vision-Language
  Processing
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
Shruthi Bannur
Stephanie L. Hyland
Qianchu Liu
Fernando Pérez-García
Maximilian Ilse
...
Maria T. A. Wetscherek
M. Lungren
A. Nori
Javier Alvarez-Valle
Ozan Oktay
87
127
0
11 Jan 2023
Filtering, Distillation, and Hard Negatives for Vision-Language
  Pre-Training
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Filip Radenovic
Abhimanyu Dubey
Abhishek Kadian
Todor Mihaylov
Simon Vandenhende
Yash J. Patel
Y. Wen
Vignesh Ramanathan
D. Mahajan
VLM
89
86
0
05 Jan 2023
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in
  Radiology
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology
Chaoyi Wu
Xiaoman Zhang
Ya Zhang
Yanfeng Wang
Weidi Xie
LM&MAVLM
118
120
0
05 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
Learning Trajectory-Word Alignments for Video-Language Tasks
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
80
7
0
05 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with
  Pre-Training Methods
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
88
15
0
05 Jan 2023
Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples
Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples
Jiaming Zhang
Xingjun Ma
Qiaomin Yi
Jitao Sang
Yugang Jiang
Yaowei Wang
Changsheng Xu
93
26
0
31 Dec 2022
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLMAI4TS
220
75
0
30 Dec 2022
Learning Multimodal Data Augmentation in Feature Space
Learning Multimodal Data Augmentation in Feature Space
Zichang Liu
Zhiqiang Tang
Xingjian Shi
Aston Zhang
Mu Li
Anshumali Shrivastava
A. Wilson
98
23
0
29 Dec 2022
BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
Haowen Hou
Xiaopeng Yan
Yigeng Zhang
Fengzong Lian
Zhanhui Kang
BDL
34
0
0
29 Dec 2022
When are Lemons Purple? The Concept Association Bias of Vision-Language
  Models
When are Lemons Purple? The Concept Association Bias of Vision-Language Models
Yutaro Yamada
Yingtian Tang
Yoyo Zhang
Ilker Yildirim
CoGe
64
15
0
22 Dec 2022
Multi-queue Momentum Contrast for Microvideo-Product Retrieval
Multi-queue Momentum Contrast for Microvideo-Product Retrieval
Yali Du
Yin-wei Wei
Wei Ji
Fan Liu
Xin Luo
Liqiang Nie
94
16
0
22 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Generalized Decoding for Pixel, Image, and Language
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLMMLLMObjD
124
259
0
21 Dec 2022
ALCAP: Alignment-Augmented Music Captioner
ALCAP: Alignment-Augmented Music Captioner
Zihao He
Weituo Hao
Weiyi Lu
Changyou Chen
Kristina Lerman
Xuchen Song
76
1
0
21 Dec 2022
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language
  Models
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Jiaxian Guo
Junnan Li
Dongxu Li
A. M. H. Tiong
Boyang Albert Li
Dacheng Tao
Steven C. H. Hoi
VLMMLLM
79
118
0
21 Dec 2022
UnICLAM:Contrastive Representation Learning with Adversarial Masking for
  Unified and Interpretable Medical Vision Question Answering
UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering
Chenlu Zhan
Peng Peng
Hongsen Wang
Tao Chen
Hongwei Wang
MedIm
77
4
0
21 Dec 2022
MetaCLUE: Towards Comprehensive Visual Metaphors Research
MetaCLUE: Towards Comprehensive Visual Metaphors Research
Arjun Reddy Akula
Brenda S. Driscoll
P. Narayana
Soravit Changpinyo
Zhi-xuan Jia
...
Sugato Basu
Leonidas Guibas
William T. Freeman
Yuanzhen Li
Varun Jampani
CLIPVLM
56
26
0
19 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
70
38
0
19 Dec 2022
Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document
  Understanding
Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding
Haoli Bai
Zhiguang Liu
Xiaojun Meng
Wentao Li
Shuangning Liu
...
Liangwei Wang
Lu Hou
Jiansheng Wei
Xin Jiang
Qun Liu
ViT
77
13
0
19 Dec 2022
Attentive Mask CLIP
Attentive Mask CLIP
Yifan Yang
Weiquan Huang
Yixuan Wei
Houwen Peng
Xinyang Jiang
...
Fangyun Wei
Yin Wang
Han Hu
Lili Qiu
Yuqing Yang
CLIPVLM
83
27
0
16 Dec 2022
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal
  Contributions in Vision and Language Models & Tasks
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Letitia Parcalabescu
Anette Frank
88
28
0
15 Dec 2022
Visually-augmented pretrained language models for NLP tasks without
  images
Visually-augmented pretrained language models for NLP tasks without images
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Qinyu Zhang
Ji-Rong Wen
VLM
56
10
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
NLIP: Noise-robust Language-Image Pre-training
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
109
30
0
14 Dec 2022
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Zixian Ma
Jerry Hong
Mustafa Omer Gul
Mona Gandhi
Irena Gao
Ranjay Krishna
CoGe
94
142
0
13 Dec 2022
Uniform Masking Prevails in Vision-Language Pretraining
Uniform Masking Prevails in Vision-Language Pretraining
Siddharth Verma
Yuchen Lu
Rui Hou
Hanchao Yu
Nicolas Ballas
Madian Khabsa
Amjad Almahairi
VLM
50
0
0
10 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
125
81
0
09 Dec 2022
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive
  Learning
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
Jishnu Mukhoti
Tsung-Yu Lin
Omid Poursaeed
Rui Wang
Ashish Shah
Philip Torr
Ser-Nam Lim
VLM
135
83
0
09 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food
  Retrieval
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIPCoGe
95
9
0
08 Dec 2022
Graph Matching with Bi-level Noisy Correspondence
Graph Matching with Bi-level Noisy Correspondence
Yijie Lin
Mouxing Yang
Jun Yu
Peng Hu
Changqing Zhang
Xiaocui Peng
112
33
0
08 Dec 2022
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain
  Tasks
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
Zhongwei Wan
Yichun Yin
Wei Zhang
Jiaxin Shi
Lifeng Shang
Guangyong Chen
Xin Jiang
Qun Liu
VLMCLL
125
18
0
07 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
88
27
0
07 Dec 2022
Previous
123...192021...232425
Next