ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
PØDA: Prompt-driven Zero-shot Domain Adaptation
PØDA: Prompt-driven Zero-shot Domain Adaptation
Mohammad Fahes
Tuan-Hung Vu
Andrei Bursuc
Patrick Pérez
Raoul de Charette
VLM
151
49
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
169
332
0
06 Dec 2022
LUNA: Language Understanding with Number Augmentations on Transformers
  via Number Plugins and Pre-training
LUNA: Language Understanding with Number Augmentations on Transformers via Number Plugins and Pre-training
Hongwei Han
Jialiang Xu
Mengyuan Zhou
Yijia Shao
Shi Han
Dongmei Zhang
LMTD
97
9
0
06 Dec 2022
Controllable Image Captioning via Prompting
Controllable Image Captioning via Prompting
Ning Wang
Jiahao Xie
Jihao Wu
Mingbo Jia
Linlin Li
61
24
0
04 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation
  Learning
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
64
2
0
02 Dec 2022
Normalized Contrastive Learning for Text-Video Retrieval
Normalized Contrastive Learning for Text-Video Retrieval
Yookoon Park
Mahmoud Azab
Bo Xiong
Seungwhan Moon
Florian Metze
Gourab Kundu
Kirmani Ahmed
75
12
0
30 Nov 2022
Improving Commonsense in Vision-Language Models via Knowledge Graph
  Riddles
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Shuquan Ye
Yujia Xie
Dongdong Chen
Yichong Xu
Lu Yuan
Chenguang Zhu
Jing Liao
VLM
66
12
0
29 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Qi Zhang
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjDVLM
62
1
0
28 Nov 2022
Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image
  Models
Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models
Lei Wang
Jian He
Xingdong Xu
Ning Liu
Hui-juan Liu
74
2
0
27 Nov 2022
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive
  Semantic Segmentation
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation
Kaihong Wang
Donghyun Kim
Regerio Feris
Kate Saenko
Margrit Betke
ViT
64
4
0
27 Nov 2022
CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification
  without Concrete Text Labels
CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels
Siyuan Li
Li Sun
Qingli Li
VLM
151
170
0
25 Nov 2022
Self-supervised vision-language pretraining for Medical visual question
  answering
Self-supervised vision-language pretraining for Medical visual question answering
Pengfei Li
Gang Liu
Lin Tan
Jinying Liao
Shenjun Zhong
MedIm
66
36
0
24 Nov 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic
  Completion Learning
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
78
15
0
24 Nov 2022
Open-vocabulary Attribute Detection
Open-vocabulary Attribute Detection
M. A. Bravo
Sudhanshu Mittal
Simon Ging
Thomas Brox
VLMObjD
92
31
0
23 Nov 2022
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
Siteng Huang
Biao Gong
Yulin Pan
Jianwen Jiang
Yiliang Lv
Yuyuan Li
Donglin Wang
VLMVPVLM
92
42
0
23 Nov 2022
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2^22-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Yan Zeng
Xinsong Zhang
Hang Li
Jiawei Wang
Jipeng Zhang
Hkust Wangchunshu Zhou
VLMMLLM
63
15
0
22 Nov 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Yikang Shen
Roei Herzig
...
Donghyun Kim
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLMCoGe
124
72
0
21 Nov 2022
Multitask Vision-Language Prompt Tuning
Multitask Vision-Language Prompt Tuning
Sheng Shen
Shijia Yang
Tianjun Zhang
Bohan Zhai
Joseph E. Gonzalez
Kurt Keutzer
Trevor Darrell
VLMVPVLM
111
53
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffMVLM
100
24
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLMCLIP
70
4
0
21 Nov 2022
Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Qinjie Zheng
Chaoyue Wang
Daqing Liu
Dadong Wang
Dacheng Tao
LRM
56
0
0
21 Nov 2022
Unifying Vision-Language Representation Space with Single-tower
  Transformer
Unifying Vision-Language Representation Space with Single-tower Transformer
Jiho Jang
Chaerin Kong
D. Jeon
Seonhoon Kim
Nojun Kwak
113
21
0
21 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
78
11
0
20 Nov 2022
A survey on knowledge-enhanced multimodal learning
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
159
15
0
19 Nov 2022
Bidirectional Generation of Structure and Properties Through a Single
  Molecular Foundation Model
Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model
Jinho Chang
Jong Chul Ye
AI4CE
63
36
0
19 Nov 2022
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual
  Question Answering
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
Yao Zhang
Haokun Chen
A. Frikha
Yezi Yang
Denis Krompass
Gengyuan Zhang
Jindong Gu
Volker Tresp
VLMLRM
81
7
0
19 Nov 2022
Task Residual for Tuning Vision-Language Models
Task Residual for Tuning Vision-Language Models
Tao Yu
Zhihe Lu
Xin Jin
Zhibo Chen
Xinchao Wang
VLMCLIP
102
92
0
18 Nov 2022
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
James Smith
Paola Cascante-Bonilla
Assaf Arbelle
Donghyun Kim
Yikang Shen
David D. Cox
Diyi Yang
Z. Kira
Rogerio Feris
Leonid Karlinsky
VLM
146
23
0
17 Nov 2022
I Can't Believe There's No Images! Learning Visual Tasks Using only
  Language Supervision
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
Sophia Gu
Christopher Clark
Aniruddha Kembhavi
VLM
68
26
0
17 Nov 2022
PromptCap: Prompt-Guided Task-Aware Image Captioning
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
113
106
0
15 Nov 2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
  Alignment
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyan Wang
Yi Zhang
Ming Yan
Ji Zhang
Jitao Sang
VLM
60
9
0
14 Nov 2022
PMR: Prototypical Modal Rebalance for Multimodal Learning
PMR: Prototypical Modal Rebalance for Multimodal Learning
Yunfeng Fan
Wenchao Xu
Yining Qi
Junxiao Wang
Song Guo
74
72
0
14 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for
  Understanding and Generation
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
MLLMVLM
88
8
0
09 Nov 2022
Gradient Knowledge Distillation for Pre-trained Language Models
Gradient Knowledge Distillation for Pre-trained Language Models
Lean Wang
Lei Li
Xu Sun
VLM
69
5
0
02 Nov 2022
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
  Object Detection
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
Yanxin Long
Jianhua Han
Runhu Huang
Xu Hang
Yi Zhu
Chunjing Xu
Xiaodan Liang
VLMObjD
104
19
0
02 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal
Ben Bogin
Jonathan Berant
VLM
48
2
0
01 Nov 2022
Generative Negative Text Replay for Continual Vision-Language
  Pretraining
Generative Negative Text Replay for Continual Vision-Language Pretraining
Shipeng Yan
Lanqing Hong
Hang Xu
Jianhua Han
Tinne Tuytelaars
Zhenguo Li
Xuming He
VLMCLLCLIP
77
18
0
31 Oct 2022
UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal
  Guidance
UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance
Wei Li
Xue Xu
Xinyan Xiao
Jiacheng Liu
Hu Yang
...
Zhanpeng Wang
Zhifan Feng
Qiaoqiao She
Yajuan Lyu
Hua Wu
232
30
0
28 Oct 2022
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
  Retrieval and Captioning
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Suvir Mirchandani
Licheng Yu
Mengjiao MJ Wang
Animesh Sinha
Wen-Jun Jiang
Tao Xiang
Ning Zhang
81
16
0
26 Oct 2022
What's Different between Visual Question Answering for Machine
  "Understanding" Versus for Accessibility?
What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?
Yang Trista Cao
Kyle Seelman
Kyungjun Lee
Hal Daumé
41
5
0
26 Oct 2022
FairCLIP: Social Bias Elimination based on Attribute Prototype Learning
  and Representation Neutralization
FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization
Junyan Wang
Yi Zhang
Jitao Sang
FaMLVLM
89
24
0
26 Oct 2022
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual
  Question Answering
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Q. Si
Yuanxin Liu
Zheng Lin
Peng Fu
Weiping Wang
VLM
117
1
0
26 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak
  Supervision
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
45
6
0
24 Oct 2022
Multilingual Multimodal Learning with Machine Translated Text
Multilingual Multimodal Learning with Machine Translated Text
Chen Qiu
Dan Oneaţă
Emanuele Bugliarello
Stella Frank
Desmond Elliott
121
15
0
24 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
55
6
0
24 Oct 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
  Modeling
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
98
19
0
21 Oct 2022
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Hong Xuan
Xi Chen
69
2
0
21 Oct 2022
Image-Text Retrieval with Binary and Continuous Label Supervision
Image-Text Retrieval with Binary and Continuous Label Supervision
Zheng Li
Caili Guo
Zerun Feng
Lei Li
Ying Jin
Yufeng Zhang
VLM
71
4
0
20 Oct 2022
CLIP-Driven Fine-grained Text-Image Person Re-identification
CLIP-Driven Fine-grained Text-Image Person Re-identification
Shuanglin Yan
Neng Dong
Liyan Zhang
Jinhui Tang
93
96
0
19 Oct 2022
MMGA: Multimodal Learning with Graph Alignment
MMGA: Multimodal Learning with Graph Alignment
Xuanqi Yang
Quanjin Tao
Xiaojuan Feng
Donghong Cai
Xiang Ren
Yang Yang
34
0
0
18 Oct 2022
Previous
123...202122232425
Next