ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
Factual Serialization Enhancement: A Key Innovation for Chest X-ray
  Report Generation
Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation
Kang Liu
Zhuoqi Ma
Mengmeng Liu
Zhicheng Jiao
Xiaolu Kang
Qiguang Miao
Kun Xie
MedIm
83
1
0
15 May 2024
Self-supervised vision-langage alignment of deep learning
  representations for bone X-rays analysis
Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis
A. Englebert
Anne-Sophie Collin
O. Cornu
Christophe De Vleeschouwer
74
1
0
14 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
93
10
0
14 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
66
0
0
09 May 2024
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Shibo Jie
Yehui Tang
Ning Ding
Zhi-Hong Deng
Kai Han
Yunhe Wang
VLM
111
11
0
09 May 2024
Using Machine Translation to Augment Multilingual Classification
Using Machine Translation to Augment Multilingual Classification
Adam King
83
0
0
09 May 2024
All in One Framework for Multimodal Re-identification in the Wild
All in One Framework for Multimodal Re-identification in the Wild
He Li
Mang Ye
Ming Zhang
Bo Du
83
11
0
08 May 2024
Is Sora a World Simulator? A Comprehensive Survey on General World
  Models and Beyond
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
Zheng Zhu
Xiaofeng Wang
Wangbo Zhao
Chen Min
Nianchen Deng
...
Dawei Zhao
Liang Xiao
Jian-jun Zhao
Jiwen Lu
Guan Huang
VGenLM&Ro
174
48
0
06 May 2024
Knowledge-aware Text-Image Retrieval for Remote Sensing Images
Knowledge-aware Text-Image Retrieval for Remote Sensing Images
Li Mi
Xianjie Dai
J. Castillo-Navarro
D. Tuia
62
5
0
06 May 2024
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Jiacheng Cheng
Hijung Valentina Shin
Nuno Vasconcelos
Bryan C. Russell
Fabian Caba Heilbron
VLM
60
1
0
06 May 2024
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
CLIPVLM
175
23
0
30 Apr 2024
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi
Jana Kosecka
67
1
0
29 Apr 2024
TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation
TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation
Junhao Cheng
Baiqiao Yin
Kaixin Cai
Minbin Huang
Hanhui Li
...
Yue Li
Yifei Li
Yuhao Cheng
Yiqiang Yan
Xiaodan Liang
DiffMMLLM
138
13
0
29 Apr 2024
Efficient Remote Sensing with Harmonized Transfer Learning and Modality
  Alignment
Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment
Tengjun Huang
117
0
0
28 Apr 2024
Multimodal Fusion on Low-quality Data: A Comprehensive Survey
Multimodal Fusion on Low-quality Data: A Comprehensive Survey
Qingyang Zhang
Yake Wei
Zongbo Han
Huazhu Fu
Xi Peng
...
Qinghua Hu
Cai Xu
Jie Wen
Di Hu
Changqing Zhang
121
31
0
27 Apr 2024
Medical Vision-Language Pre-Training for Brain Abnormalities
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
103
0
0
27 Apr 2024
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial
  Self-Highlighting
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Xuri Ge
Songpei Xu
Fuhai Chen
Jie Wang
Guoxin Wang
Shan An
Joemon M. Jose
3DPC
110
12
0
26 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via
  Verbose Samples
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Philip Torr
Wei Liu
Zhifeng Li
127
13
0
25 Apr 2024
Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical
  Visual Language Pre-trained Models
Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models
Jiawei Chen
Dingkang Yang
Yue Jiang
Mingcheng Li
Jinjie Wei
Xiaolu Hou
Lihua Zhang
111
6
0
25 Apr 2024
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and
  Lexical Alterations
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
VLMCoGe
106
0
0
25 Apr 2024
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities
  in Semantic Dataset Deduplication
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Eric Slyman
Stefan Lee
Scott D. Cohen
Kushal Kafle
VLM
59
5
0
24 Apr 2024
Leveraging Large Language Models for Multimodal Search
Leveraging Large Language Models for Multimodal Search
Oriol Barbany
Michael Huang
Xinliang Zhu
Arnab Dhua
92
10
0
24 Apr 2024
Visual Delta Generator with Large Multi-modal Models for Semi-supervised
  Composed Image Retrieval
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
Young Kyun Jang
Donghyun Kim
Zihang Meng
Dat Huynh
Ser-Nam Lim
79
12
0
23 Apr 2024
Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation
Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation
Yikun Zhang
Geyan Ye
Chaohao Yuan
Bo Han
Long-Kai Huang
Jianhua Yao
Wei Liu
Yu Rong
151
3
0
23 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
117
2
0
22 Apr 2024
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language
  Pre-training Models
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Shouwei Ruan
Yinpeng Dong
Hanqing Liu
Yao Huang
Hang Su
Xingxing Wei
VLM
103
1
0
18 Apr 2024
Functional Protein Design with Local Domain Alignment
Functional Protein Design with Local Domain Alignment
Chaohao Yuan
Songyou Li
Geyan Ye
Yikun Zhang
Long-Kai Huang
Wenbing Huang
Wei Liu
Jianhua Yao
Yu Rong
91
1
0
18 Apr 2024
Curriculum Point Prompting for Weakly-Supervised Referring Image
  Segmentation
Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation
Qiyuan Dai
Sibei Yang
86
9
0
18 Apr 2024
The devil is in the object boundary: towards annotation-free instance
  segmentation using Foundation Models
The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models
Cheng Shi
Sibei Yang
VLM
96
4
0
18 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
153
7
0
18 Apr 2024
Vocabulary-free Image Classification and Semantic Segmentation
Vocabulary-free Image Classification and Semantic Segmentation
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
87
3
0
16 Apr 2024
Consistency and Uncertainty: Identifying Unreliable Responses From
  Black-Box Vision-Language Models for Selective Visual Question Answering
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Zaid Khan
Yun Fu
AAML
83
10
0
16 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
122
8
0
16 Apr 2024
Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn
  Classification without Labels
Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels
Amaya Dharmasiri
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
VLM3DPC
63
1
0
15 Apr 2024
Bridging Vision and Language Spaces with Assignment Prediction
Bridging Vision and Language Spaces with Assignment Prediction
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
VLM
97
7
0
15 Apr 2024
Fuse after Align: Improving Face-Voice Association Learning via
  Multimodal Encoder
Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder
Chong Peng
Liqiang He
Dan Su
CVBM
107
0
0
15 Apr 2024
Probing the 3D Awareness of Visual Foundation Models
Probing the 3D Awareness of Visual Foundation Models
Mohamed El Banani
Amit Raj
Kevis-Kokitsi Maninis
Abhishek Kar
Yuanzhen Li
Michael Rubinstein
Deqing Sun
Leonidas Guibas
Justin Johnson
Varun Jampani
101
86
0
12 Apr 2024
Connecting NeRFs, Images, and Text
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
112
5
0
11 Apr 2024
How is Visual Attention Influenced by Text Guidance? Database and Model
How is Visual Attention Influenced by Text Guidance? Database and Model
Yinan Sun
Xiongkuo Min
Huiyu Duan
Guangtao Zhai
167
4
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLMVLM
80
32
0
10 Apr 2024
Unified Language-driven Zero-shot Domain Adaptation
Unified Language-driven Zero-shot Domain Adaptation
Senqiao Yang
Zhuotao Tian
Li Jiang
Jiaya Jia
95
10
0
10 Apr 2024
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina
Massimiliano Mancini
Elia Cunegatti
Gaowen Liu
Giovanni Iacca
Elisa Ricci
VLM
79
2
0
08 Apr 2024
Progressive Alignment with VLM-LLM Feature to Augment Defect
  Classification for the ASE Dataset
Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
37
0
0
08 Apr 2024
Bootstrapping Chest CT Image Understanding by Distilling Knowledge from
  X-ray Expert Models
Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models
Weiwei Cao
Jianpeng Zhang
Yingda Xia
Tony C. W. Mok
Zi Li
X. Ye
Le Lu
Jian Zheng
Yuxing Tang
Ling Zhang
58
4
0
07 Apr 2024
Light the Night: A Multi-Condition Diffusion Framework for Unpaired
  Low-Light Enhancement in Autonomous Driving
Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving
Jinlong Li
Baolu Li
Zhengzhong Tu
Xinyu Liu
Qing Guo
Felix Juefei Xu
Runsheng Xu
Hongkai Yu
DiffM
123
26
0
07 Apr 2024
To Cool or not to Cool? Temperature Network Meets Large Foundation
  Models via DRO
To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO
Zi-Hao Qiu
Siqi Guo
Mao Xu
Tuo Zhao
Lijun Zhang
Tianbao Yang
AI4TSAI4CE
117
4
0
06 Apr 2024
Label Propagation for Zero-shot Classification with Vision-Language
  Models
Label Propagation for Zero-shot Classification with Vision-Language Models
Vladan Stojnić
Yannis Kalantidis
Giorgos Tolias
VLM
71
9
0
05 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
132
28
0
04 Apr 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
  Determines Multimodal Model Performance
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Vishaal Udandarao
Ameya Prabhu
Adhiraj Ghosh
Yash Sharma
Philip Torr
Adel Bibi
Samuel Albanie
Matthias Bethge
VLM
220
55
0
04 Apr 2024
DeViDe: Faceted medical knowledge for improved medical vision-language
  pre-training
DeViDe: Faceted medical knowledge for improved medical vision-language pre-training
Haozhe Luo
Ziyu Zhou
Corentin Royer
Anjany Sekuboyina
Bjoern Menze
VLMViTMedIm
101
7
0
04 Apr 2024
Previous
123...789...232425
Next