ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA
Zhengyang Ji
Shang Gao
Li Liu
Yifan Jia
Yutao Yue
58
0
0
04 Mar 2025
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Ziyang Zhang
Yang Yu
Yucheng Chen
Xulei Yang
S. Yeo
MedIm
179
2
0
02 Mar 2025
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis
Yun Wang
Jingchen Ni
Yong-Jin Liu
Chun Yuan
Yansong Tang
96
4
0
02 Mar 2025
Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series
Yanan Niu
Roy Sarkis
D. Psaltis
Mario Paolone
Christophe Moser
Luisa Lambertini
131
0
0
28 Feb 2025
MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal Classification
MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal Classification
Tianze Zhang
Shu Shen
Chao Chen
116
0
0
27 Feb 2025
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Aayush Dhakal
Srikumar Sastry
Subash Khanal
Adeel Ahmad
Eric Xing
Nathan Jacobs
152
0
0
27 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Chenyang Zhao
Kun Wang
J. H. Hsiao
Antoni B. Chan
CLIP
108
0
0
26 Feb 2025
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models
Jiawei Kong
Hao Fang
Sihang Guo
Chenxi Qing
Bin Chen
Bin Wang
Shu-Tao Xia
AAMLVLM
132
0
0
26 Feb 2025
Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation
Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation
Wenxuan Wang
K. Wu
Yujian Betterest Li
Dan Wang
Xinsong Zhang
Qingbin Liu
AI4TS
115
1
0
24 Feb 2025
CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification
CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification
Liping Lu
Zihao Fu
Duanfeng Chu
Wei Wang
Bingrong Xu
VLM
102
0
0
24 Feb 2025
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
Wenzhe Yin
Zehao Xiao
Pan Zhou
Shujian Yu
Jiayi Shen
Jan-Jakob Sonke
E. Gavves
177
1
0
24 Feb 2025
Graph Perceiver IO: A General Architecture for Graph Structured Data
Graph Perceiver IO: A General Architecture for Graph Structured Data
Seyun Bae
Hoyoon Byun
Changdae Oh
Yoon-Sik Cho
Kyungwoo Song
GNN
256
3
0
24 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
110
16
0
24 Feb 2025
Understanding the Emergence of Multimodal Representation Alignment
Understanding the Emergence of Multimodal Representation Alignment
Megan Tjandrasuwita
Chanakya Ekbote
Liu Ziyin
Paul Pu Liang
108
2
0
22 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
529
0
0
21 Feb 2025
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Yuheng Ji
Yue Liu
Zhicheng Zhang
Zhao Zhang
Yuting Zhao
Gang Zhou
Xingwei Zhang
Xinwang Liu
Xiaolong Zheng
VLM
184
4
0
21 Feb 2025
Learning Generalizable Prompt for CLIP with Class Similarity Knowledge
Learning Generalizable Prompt for CLIP with Class Similarity Knowledge
Sehun Jung
Hyang-won Lee
VLMVPVLM
73
0
0
17 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
97
0
0
09 Feb 2025
Boosting Weak Positives for Text Based Person Search
Boosting Weak Positives for Text Based Person Search
Akshay Modi
Ashhar Aziz
Nilanjana Chatterjee
A V Subramanyam
142
0
0
29 Jan 2025
sDREAMER: Self-distilled Mixture-of-Modality-Experts Transformer for Automatic Sleep Staging
Jingyuan Chen
Yuan Yao
Mie Anderson
Natalie Hauglund
Celia Kjaerby
Verena Untiet
Maiken Nedergaard
Jiebo Luo
153
2
0
28 Jan 2025
Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product Gap
Srivatsa Mallapragada
Ying Xie
Varsha Rani Chawan
Zeyad Hailat
Yuanbo Wang
108
0
0
28 Jan 2025
BiFold: Bimanual Cloth Folding with Language Guidance
BiFold: Bimanual Cloth Folding with Language Guidance
Oriol Barbany
Adrià Colomé
Carme Torras
42
1
0
27 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
83
0
0
20 Jan 2025
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
J. Park
Jungbeom Lee
Jongyoon Song
Sangwon Yu
Dahuin Jung
Sungroh Yoon
122
3
0
19 Jan 2025
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
Pengcheng Zhao
Zhixian He
Fuwei Zhang
Shujin Lin
Fan Zhou
136
2
0
18 Jan 2025
A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval
A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval
Weihang Zhang
Jihao Li
Shuoke Li
Ziqing Niu
Jialiang Chen
Wenkai Zhang
VLM
78
0
0
18 Jan 2025
Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data
Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data
Omar Mena
Alexandre Kouyoumdjian
Lonni Besancon
Michael Gleicher
I. Viola
Anders Ynnerman
93
0
0
17 Jan 2025
MULTI: Multimodal Understanding Leaderboard with Text and Images
MULTI: Multimodal Understanding Leaderboard with Text and Images
Zichen Zhu
Yang Xu
Lu Chen
Jingkai Yang
Yichuan Ma
...
Yingzi Ma
Situo Zhang
Zihan Zhao
Liangtai Sun
Kai Yu
VLM
116
5
0
08 Jan 2025
Foundations of GenIR
Qingyao Ai
Jingtao Zhan
Yang Liu
126
0
0
06 Jan 2025
GeAR: Generation Augmented Retrieval
GeAR: Generation Augmented Retrieval
Haoyu Liu
Shaohan Huang
Jianfeng Liu
Yuefeng Zhan
H. Sun
Weiwei Deng
Feng Sun
Furu Wei
Qi Zhang
84
1
0
06 Jan 2025
Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion
Hebin Wang
Yangning Li
Hai-Tao Zheng
Hai-Tao Zheng
Wenhao Jiang
Hong-Gee Kim
143
0
0
03 Jan 2025
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILawLM&MALRM
145
30
0
31 Dec 2024
Enhancing Visual Representation for Text-based Person Searching
Enhancing Visual Representation for Text-based Person Searching
Wei Shen
Ming Fang
Yuxia Wang
Jiafeng Xiao
Diping Li
Ningyu Zhang
Ling Xu
Weinan Zhang
111
4
0
31 Dec 2024
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios
Ning Liao
Xiaopeng Zhang
Minglu Cao
Junchi Yan
VPVLMVLM
183
0
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
282
5
0
31 Dec 2024
Improving Generated and Retrieved Knowledge Combination Through
  Zero-shot Generation
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Yi Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
141
2
0
25 Dec 2024
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video
  Temporal Grounding
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding
Zhuo Cao
Bingqing Zhang
Heming Du
Xin Yu
Xue Li
Sen Wang
125
2
0
18 Dec 2024
Bringing Multimodality to Amazon Visual Search System
Bringing Multimodality to Amazon Visual Search System
Xinliang Zhu
Michael Huang
Han Ding
Jinyu Yang
Kelvin Chen
...
Son Dinh Tran
Benjamin Z. Yao
Doug Gray
Anuj Bindal
Arnab Dhua
112
3
0
17 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
143
11
0
17 Dec 2024
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang
Tang Li
Kien X. Nguyen
Xi Peng
182
0
0
17 Dec 2024
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
S. Nagendra
Kashif Rashid
Chaopeng Shen
Daniel Kifer
VLM
143
2
0
16 Dec 2024
Does VLM Classification Benefit from LLM Description Semantics?
Does VLM Classification Benefit from LLM Description Semantics?
Pingchuan Ma
Lennart Rietdorf
Dmytro Kotovenko
Vincent Tao Hu
Bjorn Ommer
VLM
148
1
0
16 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
204
4
0
16 Dec 2024
ViSymRe: Vision-guided Multimodal Symbolic Regression
ViSymRe: Vision-guided Multimodal Symbolic Regression
Da Li
Junping Yin
Jin Xu
Xinxin Li
Juan Zhang
130
1
0
15 Dec 2024
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs
Gorden Liu
Yu Sun
R.-H. Sun
Xin Dong
Hongyu Xiong
Hongyu Xiong
LLMAG
125
1
0
15 Dec 2024
Rebalanced Vision-Language Retrieval Considering Structure-Aware
  Distillation
Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation
Yang Yang
Wenjuan Xi
Luping Zhou
Jinhui Tang
148
0
0
14 Dec 2024
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models
  for Universal Cross-Domain Retrieval
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Haoyu Jiang
Zhi-Qi Cheng
Gabriel Moreira
Jiawen Zhu
Jingdong Sun
Bukun Ren
Jun-Yan He
Qi Dai
Xian-Sheng Hua
VLM
142
0
0
14 Dec 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
353
10
0
11 Dec 2024
Explaining and Mitigating the Modality Gap in Contrastive Multimodal
  Learning
Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning
Can Yaras
Siyi Chen
Peng Wang
Q. Qu
VLM
89
3
0
10 Dec 2024
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph
  Generation with Enhanced Spatial Relations
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu
Mengyang Wu
Yuzhi Zhao
Jason Chun Lok Li
Weifeng Ou
LRMSyDaVLM
129
4
0
09 Dec 2024
Previous
123456...232425
Next