ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Chenyu You
Caiming Xiong
Guosheng Lin
    FaML
ArXivPDFHTML

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,195 papers shown
Title
Efficient Vision-Language Pretraining with Visual Concepts and
  Hierarchical Alignment
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
24
27
0
29 Aug 2022
CMD: Self-supervised 3D Action Representation Learning with Cross-modal
  Mutual Distillation
CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation
Yunyao Mao
Wen-gang Zhou
Zhenbo Lu
Jiajun Deng
Houqiang Li
30
38
0
26 Aug 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
  Pretraining
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
54
158
0
25 Aug 2022
MuMUR : Multilingual Multimodal Universal Retrieval
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
44
3
0
24 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
54
629
0
22 Aug 2022
Revising Image-Text Retrieval via Multi-Modal Entailment
Revising Image-Text Retrieval via Multi-Modal Entailment
Xu Yan
Chunhui Ai
Ziqiang Cao
Min Cao
Sujian Li
Wen-Yi Chen
Guohong Fu
28
0
0
22 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
34
11
0
19 Aug 2022
See Finer, See More: Implicit Modality Alignment for Text-based Person
  Retrieval
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval
Xiujun Shu
Wei Wen
Haoqian Wu
Keyun Chen
Yi-Zhe Song
Ruizhi Qiao
Bohan Ren
Xiao Wang
27
91
0
18 Aug 2022
Multimodal foundation models are better simulators of the human brain
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
37
13
0
17 Aug 2022
ConTextual Masked Auto-Encoder for Dense Passage Retrieval
ConTextual Masked Auto-Encoder for Dense Passage Retrieval
Xing Wu
Guangyuan Ma
Meng Lin
Zijia Lin
Zhongyuan Wang
Songlin Hu
RALM
27
24
0
16 Aug 2022
Self-supervised Multi-modal Training from Uncurated Image and Reports
  Enables Zero-shot Oversight Artificial Intelligence in Radiology
Self-supervised Multi-modal Training from Uncurated Image and Reports Enables Zero-shot Oversight Artificial Intelligence in Radiology
Sangjoon Park
Eunha Lee
Kyung Sook Shin
Jeonghyeon Lee
Jong Chul Ye
33
2
0
10 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
  Pre-training
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
23
11
0
08 Aug 2022
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
  for Multi-Modal Understanding
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang
Feiya Lv
Ting Yao
Yiming Yuan
Jin Ma
Yu Luo
Haijin Liang
31
3
0
05 Aug 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian
Siliang Tang
VLM
41
79
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
36
67
0
03 Aug 2022
Integrating Object-aware and Interaction-aware Knowledge for Weakly
  Supervised Scene Graph Generation
Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation
Xingchen Li
Long Chen
Wenbo Ma
Yi Yang
Jun Xiao
21
26
0
03 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual
  Semantics
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
25
1
0
31 Jul 2022
V$^2$L: Leveraging Vision and Vision-language Models into Large-scale
  Product Retrieval
V2^22L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval
Wenhao Wang
Yifan Sun
Zongxin Yang
Yi Yang
VLM
24
3
0
26 Jul 2022
A Priority Map for Vision-and-Language Navigation with Trajectory Plans
  and Feature-Location Cues
A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues
Jason Armitage
L. Impett
Rico Sennrich
33
5
0
24 Jul 2022
Semantic Abstraction: Open-World 3D Scene Understanding from 2D
  Vision-Language Models
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
Huy Ha
Shuran Song
LM&Ro
VLM
43
102
0
23 Jul 2022
Exploiting Unlabeled Data with Vision and Language Models for Object
  Detection
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Shiyu Zhao
Zhixing Zhang
S. Schulter
Long Zhao
Vijay Kumar B.G
Anastasis Stathopoulos
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
27
100
0
18 Jul 2022
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
Xiaoping Han
Licheng Yu
Xiatian Zhu
Li Zhang
Yi-Zhe Song
Tao Xiang
AI4TS
16
49
0
17 Jul 2022
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang
Yinan Li
Jiashi Feng
Xinglong Wu
Xiaoshuai Sun
Rongrong Ji
VLM
24
48
0
16 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
34
271
0
15 Jul 2022
Video Graph Transformer for Video Question Answering
Video Graph Transformer for Video Question Answering
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
159
75
0
12 Jul 2022
IDEA: Increasing Text Diversity via Online Multi-Label Recognition for
  Vision-Language Pre-training
IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
Xinyu Huang
Youcai Zhang
Ying Cheng
Weiwei Tian
Ruiwei Zhao
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Xuanyang Zhang
VLM
21
14
0
12 Jul 2022
Vision-and-Language Pretraining
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
A. Luu
VLM
CLIP
27
2
0
05 Jul 2022
Counterfactually Measuring and Eliminating Social Bias in
  Vision-Language Pre-training Models
Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models
Yi Zhang
Junyan Wang
Jitao Sang
22
27
0
03 Jul 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with
  Objects, Attributes and Relations
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Tiancheng Zhao
Tianqi Zhang
Mingwei Zhu
Haozhan Shen
Kyusong Lee
Xiaopeng Lu
Jianwei Yin
VLM
CoGe
MLLM
45
91
0
01 Jul 2022
Improving Visual Grounding by Encouraging Consistent Gradient-based
  Explanations
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Ziyan Yang
Kushal Kafle
Franck Dernoncourt
Vicente Ordónez Román
VLM
28
19
0
30 Jun 2022
Towards Adversarial Attack on Vision-Language Pre-training Models
Towards Adversarial Attack on Vision-Language Pre-training Models
Jiaming Zhang
Qiaomin Yi
Jitao Sang
VLM
AAML
27
94
0
19 Jun 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
34
42
0
17 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
71
393
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language
  Representation Learning
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
51
64
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
25
83
0
16 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
41
228
0
16 Jun 2022
Write and Paint: Generative Vision-Language Models are Unified Modal
  Learners
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Shizhe Diao
Wangchunshu Zhou
Xinsong Zhang
Jiawei Wang
MLLM
AI4CE
22
16
0
15 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
30
124
0
15 Jun 2022
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
  Transformer
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer
Jiajun Deng
Zhengyuan Yang
Daqing Liu
Tianlang Chen
Wen-gang Zhou
Yanyong Zhang
Houqiang Li
Wanli Ouyang
ViT
30
50
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
72
528
0
13 Jun 2022
INDIGO: Intrinsic Multimodality for Domain Generalization
INDIGO: Intrinsic Multimodality for Domain Generalization
Puneet Mangla
Shivam Chandhok
Milan Aggarwal
V. Balasubramanian
Balaji Krishnamurthy
VLM
41
2
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjD
VLM
24
290
0
12 Jun 2022
A Unified Continuous Learning Framework for Multi-modal Knowledge
  Discovery and Pre-training
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training
Zhihao Fan
Zhongyu Wei
Jingjing Chen
Siyuan Wang
Zejun Li
Jiarong Xu
Xuanjing Huang
CLL
14
6
0
11 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
24
111
0
07 Jun 2022
Delving into the Openness of CLIP
Delving into the Openness of CLIP
Shuhuai Ren
Lei Li
Xuancheng Ren
Guangxiang Zhao
Xu Sun
VLM
28
13
0
04 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
12
45
0
02 Jun 2022
Prefix Conditioning Unifies Language and Label Supervision
Prefix Conditioning Unifies Language and Label Supervision
Kuniaki Saito
Kihyuk Sohn
Xinming Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
VLM
CLIP
34
16
0
02 Jun 2022
CLIP4IDC: CLIP for Image Difference Captioning
CLIP4IDC: CLIP for Image Difference Captioning
Zixin Guo
T. Wang
Jorma T. Laaksonen
VLM
29
27
0
01 Jun 2022
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal
  Pre-training
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
Yan Zeng
Wangchunshu Zhou
Ao Luo
Ziming Cheng
Xinsong Zhang
VLM
27
30
0
01 Jun 2022
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Pengyuan Lyu
Chengquan Zhang
Shanshan Liu
Meina Qiao
Yangliu Xu
Liang Wu
Kun Yao
Junyu Han
Errui Ding
Jingdong Wang
32
42
0
01 Jun 2022
Previous
123...21222324
Next