ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLMVLM
16
0
0
20 Jun 2025
Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments
Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments
Yasir Ali Farrukh
S. Wali
I. Khan
Nathaniel D. Bastian
VLM
18
0
0
20 Jun 2025
Learning Event Completeness for Weakly Supervised Video Anomaly Detection
Learning Event Completeness for Weakly Supervised Video Anomaly Detection
Yu Wang
Shiwei Chen
27
0
0
16 Jun 2025
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Bo-Cheng Chiu
Jen-Jee Chen
Yu-Chee Tseng
Feng-Chi Chen
14
0
0
13 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoEVLM
30
0
0
13 Jun 2025
Dynamic Double Space Tower
Dynamic Double Space Tower
Weikai Sun
Shijie Song
Han Wang
15
0
0
13 Jun 2025
Can Sound Replace Vision in LLaVA With Token Substitution?
Can Sound Replace Vision in LLaVA With Token Substitution?
Ali Vosoughi
Jing Bi
Pinxin Liu
Yunlong Tang
Chenliang Xu
CLIPVLM
119
0
0
12 Jun 2025
ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model
Jialong Zuo
Yongtai Deng
Mengdan Tan
Rui Jin
Dongyue Wu
Nong Sang
Liang Pan
Changxin Gao
51
0
0
11 Jun 2025
3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
Seonho Lee
Jiho Choi
Inha Kang
Jiwook Kim
J. Park
Hyunjung Shim
VLM
60
0
0
11 Jun 2025
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
Leqi Shen
Guoqiang Gong
Tianxiang Hao
Tao He
Yifeng Zhang
Pengzhang Liu
Sicheng Zhao
Jungong Han
Guiguang Ding
24
0
0
10 Jun 2025
Spatial Transcriptomics Expression Prediction from Histopathology Based on Cross-Modal Mask Reconstruction and Contrastive Learning
Junzhuo Liu
Markus Eckstein
Zhixiang Wang
Friedrich Feuerhake
Dorit Merhof
MedIm
18
0
0
10 Jun 2025
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models
Chenyu Lian
Hong-Yu Zhou
Dongyun Liang
J. Qin
L. Wang
MedImVLM
33
0
0
10 Jun 2025
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Boyu Chen
Siran Chen
Kunchang Li
Qinglin Xu
Yu Qiao
Yali Wang
VOS
25
0
0
09 Jun 2025
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
Pengfei Zhao
Rongbo Luan
Wei Zhang
Peng Wu
Sifeng He
25
0
0
08 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
FREE: Fast and Robust Vision Language Models with Early Exits
Divya J. Bajpai
M. Hanawal
VLM
15
0
0
07 Jun 2025
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
Adarsh Prasad Behera
J. Champati
Roberto Morabito
Sasu Tarkoma
J. Gross
23
0
0
06 Jun 2025
Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
Yiheng Li
Yang Yang
Zichang Tan
Huan Liu
Weihua Chen
Xu Zhou
Zhen Lei
69
0
0
06 Jun 2025
Robust Few-Shot Vision-Language Model Adaptation
Hanxin Wang
Tian Liu
Shu Kong
VLM
116
0
0
05 Jun 2025
Locality Preserving Markovian Transition for Instance Retrieval
Jifei Luo
Wenzheng Wu
Hantao Yao
Lu Yu
Changsheng Xu
85
0
0
05 Jun 2025
Aligning Multimodal Representations through an Information Bottleneck
Antonio Almudévar
José Miguel Hernández-Lobato
Sameer Khurana
R. Marxer
Alfonso Ortega
SSL
112
0
0
05 Jun 2025
Attacking Attention of Foundation Models Disrupts Downstream Tasks
Attacking Attention of Foundation Models Disrupts Downstream Tasks
Hondamunige Prasanna Silva
Federico Becattini
Lorenzo Seidenari
AAML
20
0
0
03 Jun 2025
Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
Shuyu Yang
Yilun Wang
Yaxiong Wang
Li Zhu
Zhedong Zheng
VGen
64
0
0
02 Jun 2025
iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection
iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection
Huahui Yi
Wei Xu
Ziyuan Qin
Xi Chen
Xiaohu Wu
Kang Li
Qicheng Lao
VLM
29
0
0
31 May 2025
M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction
M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction
Cunhang Fan
Ying Chen
Jian Zhou
Zexu Pan
Jingjing Zhang
Youdian Gao
Xiaoke Yang
Zhengqi Wen
Zhao Lv
32
0
0
31 May 2025
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models
Ying Yang
Jie Zhang
Xiao Lv
Di Lin
Tao Xiang
Qing Guo
AAMLVLM
38
0
0
30 May 2025
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
Mingyang Mao
Mariela M. Perez-Cabarcas
Utteja Kallakuri
Nicholas R. Waytowich
Xiaomin Lin
T. Mohsenin
28
1
0
29 May 2025
Improving Contrastive Learning for Referring Expression Counting
Improving Contrastive Learning for Referring Expression Counting
Kostas Triaridis
Panagiotis Kaliosis
E-Ro Nguyen
Jingyi Xu
Hieu M. Le
Dimitris Samaras
SSL
65
0
0
28 May 2025
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Xu Li
Fan Lyu
LRM
20
0
0
26 May 2025
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
Rong-Cheng Tu
Zhao Jin
Jingyi Liao
Xiao Luo
Yingjie Wang
Li Shen
Dacheng Tao
105
0
0
26 May 2025
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Daniel Csizmadia
Andrei Codreanu
Victor Sim
Vighnesh Prabhu
Michael Lu
Kevin Zhu
Sean O'Brien
Vasu Sharma
CLIPVLM
71
0
0
25 May 2025
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
G. MEng
Sunan He
Jinpeng Wang
Tao Dai
Letian Zhang
Jieming Zhu
Qing Li
Gang Wang
Rui Zhang
Yong Jiang
VLM
296
0
0
24 May 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Taewon Kang
Ming C. Lin
DiffMVGen
83
0
0
22 May 2025
Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text
Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text
Kun-Yu Lin
Hongjun Wang
Weining Ren
Kai Han
291
0
0
22 May 2025
NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation
NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation
Weiming Wu
Zi-kang Wang
Jin Ye
Zhi Zhou
Yu-Feng Li
Lan-Zhe Guo
LRM
65
0
0
21 May 2025
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
Nikolaos Chaidos
Angeliki Dimitriou
Maria Lymperaiou
Giorgos Stamou
67
0
0
21 May 2025
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Subrata Biswas
Mohammad Nur Hossain Khan
Bashima Islam
104
0
0
21 May 2025
Beyond Text: Unveiling Privacy Vulnerabilities in Multi-modal Retrieval-Augmented Generation
Beyond Text: Unveiling Privacy Vulnerabilities in Multi-modal Retrieval-Augmented Generation
Jiankun Zhang
Shenglai Zeng
Jie Ren
Tianqi Zheng
Hui Liu
Xianfeng Tang
Hui Liu
Yi Chang
56
0
0
20 May 2025
GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval
GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval
Chengsong Sun
Weiping Li
Xiang Li
Yunxing Liu
Lianlei Shan
VLM
93
0
0
19 May 2025
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Subash Khanal
Srikumar Sastry
Aayush Dhakal
Adeel Ahmad
Nathan Jacobs
76
0
0
19 May 2025
Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding
Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding
Xuefei Sun
Doncey Albin
Cecilia Mauceri
Dusty Woods
Christoffer Heckman
LRM
46
0
0
18 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
74
0
0
16 May 2025
Geofenced Unmanned Aerial Robotic Defender for Deer Detection and Deterrence (GUARD)
Geofenced Unmanned Aerial Robotic Defender for Deer Detection and Deterrence (GUARD)
Ebasa Temesgen
Mario Jerez
Greta Brown
Graham Wilson
Sree Ganesh Lalitaditya Divakarla
Sarah Boelter
Oscar Nelson
Robert McPherson
Maria Gini
73
0
0
16 May 2025
Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare
Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare
Amara Tariq
Rimita Lahiri
Charles Kahn
Imon Banerjee
64
0
0
12 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
155
1
0
08 May 2025
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding
Feng Xiao
Hongbin Xu
Guocan Zhao
Wenxiong Kang
247
0
0
07 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Yixiao Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
440
2
0
07 May 2025
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
187
0
0
05 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIPCoGeVLM
91
0
0
04 May 2025
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin
Yuchen Wang
X. Bai
Xiaochen Li
Weili Guan
Liqiang Nie
Xinyang Chen
VLM
113
0
0
04 May 2025
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Wenfa Wu
Guanyu Zhang
Zheng Tan
Yi Wang
Hongsheng Qi
AI4TS
108
2
0
02 May 2025
1234...232425
Next