ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
    FaML
ArXivPDFHTML

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,195 papers shown
Title
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
S. Nagendra
Kashif Rashid
Chaopeng Shen
Daniel Kifer
VLM
76
2
0
16 Dec 2024
Does VLM Classification Benefit from LLM Description Semantics?
Does VLM Classification Benefit from LLM Description Semantics?
Pingchuan Ma
Lennart Rietdorf
Dmytro Kotovenko
Vincent Tao Hu
Bjorn Ommer
VLM
74
1
0
16 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
91
1
0
16 Dec 2024
ViSymRe: Vision-guided Multimodal Symbolic Regression
ViSymRe: Vision-guided Multimodal Symbolic Regression
Da Li
Junping Yin
Jin Xu
Xinxin Li
Juan Zhang
85
1
0
15 Dec 2024
AgentPS: Agentic Process Supervision for Multi-modal Content Quality
  Assurance through Multi-round QA
AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA
Gorden Liu
Yu Sun
R.-H. Sun
Xin Dong
Hongyu Xiong
LLMAG
85
1
0
15 Dec 2024
Rebalanced Vision-Language Retrieval Considering Structure-Aware
  Distillation
Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation
Yang Yang
Wenjuan Xi
Luping Zhou
Jinhui Tang
77
0
0
14 Dec 2024
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models
  for Universal Cross-Domain Retrieval
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Haoyu Jiang
Zhi-Qi Cheng
Gabriel Moreira
Jiawen Zhu
Jingdong Sun
Bukun Ren
Jun-Yan He
Qi Dai
Xian-Sheng Hua
VLM
90
0
0
14 Dec 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
107
7
0
11 Dec 2024
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph
  Generation with Enhanced Spatial Relations
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu
Mengyang Wu
Yuzhi Zhao
Jason Chun Lok Li
Weifeng Ou
LRM
SyDa
VLM
71
2
0
09 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
93
0
0
04 Dec 2024
Exploring Large Vision-Language Models for Robust and Efficient
  Industrial Anomaly Detection
Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
Kun Qian
Tianyu Sun
Wenhong Wang
71
0
0
01 Dec 2024
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du
Haoxin Li
Jianfei Yu
Boyang Li
152
0
0
01 Dec 2024
VLM-HOI: Vision Language Models for Interpretable Human-Object
  Interaction Analysis
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang
Dasol Jeong
Hyunmin Lee
Sangwoo Park
Hasil Park
Sunkyu Kwon
Yeongjoon Kim
Joonki Paik
MLLM
VLM
79
0
0
27 Nov 2024
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang
Yaxiong Wang
Li Zhu
Zhedong Zheng
98
2
0
26 Nov 2024
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Jungeun Kim
Hyeongwoo Jeon
Jongseong Bae
Ha Young Kim
SLR
85
0
0
25 Nov 2024
ResCLIP: Residual Attention for Training-free Dense Vision-language
  Inference
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
Yuhang Yang
Jinhong Deng
Wen Li
Lixin Duan
VLM
81
0
0
24 Nov 2024
Semantic Shield: Defending Vision-Language Models Against Backdooring
  and Poisoning via Fine-grained Knowledge Alignment
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam
Christopher Thomas
AAML
121
3
0
23 Nov 2024
Cross-Modal Pre-Aligned Method with Global and Local Information for
  Remote-Sensing Image and Text Retrieval
Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval
Zengbao Sun
Ming Zhao
Gaorui Liu
Andre Kaup
96
3
0
22 Nov 2024
ORID: Organ-Regional Information Driven Framework for Radiology Report
  Generation
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
Tiancheng Gu
Kaicheng Yang
Xiang An
Ziyong Feng
Dongnan Liu
Weidong Cai
74
1
0
20 Nov 2024
Joint Vision-Language Social Bias Removal for CLIP
Joint Vision-Language Social Bias Removal for CLIP
Haoyu Zhang
Yangyang Guo
Mohan S. Kankanhalli
VLM
72
0
0
19 Nov 2024
SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models
M. Ginting
Dong-Ki Kim
Sung-Kyun Kim
Bandi Jai Krishna
Mykel J. Kochenderfer
Shayegan Omidshafiei
Ali-akbar Agha-mohammadi
LM&Ro
73
0
0
18 Nov 2024
TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation
Ranmin Wang
Limin Zhuang
Hongkun Chen
Boyan Xu
Ruichu Cai
41
0
0
18 Nov 2024
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
M. Arda Aydın
Efe Mert Çırpar
Elvin Abdinli
Gözde B. Ünal
Y. Sahin
VLM
71
0
0
18 Nov 2024
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language
  Models
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
Jonathan Fhima
Elad Ben Avraham
Oren Nuriel
Yair Kittenplon
Roy Ganz
Aviad Aberdam
Ron Litman
VLM
34
1
0
07 Nov 2024
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability
  Vision-Language Attack
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack
Xiaojun Jia
Sensen Gao
Qing-Wu Guo
Ke Ma
Yihao Huang
Simeng Qin
Yang Liu
Ivor Tsang Fellow
Xiaochun Cao
AAML
46
3
0
04 Nov 2024
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning
  Through Retrieval and Understanding Modalities
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
Ehsan Faghihi
Mohammedreza Zarenejad
Ali-Asghar Beheshti Shirazi
44
0
0
04 Nov 2024
Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification
Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification
Shengxun Wei
Zan Gao
Yibo Zhao
Weili Guan
Weili Guan
Shengyong Chen
46
1
0
01 Nov 2024
Nearest Neighbor Normalization Improves Multimodal Retrieval
Nearest Neighbor Normalization Improves Multimodal Retrieval
Neil Chowdhury
Franklin Wang
Sumedh Shenoy
Douwe Kiela
Sarah Schwettmann
Tristan Thrush
VLM
32
3
0
31 Oct 2024
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed
  Image Retrieval
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval
Haiwen Li
Fei Su
Zhicheng Zhao
31
0
0
31 Oct 2024
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang
Maixuan Xue
Xinran Liu
Zheng Pan
Xing Wei
62
1
0
31 Oct 2024
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Tianyu Yang
Lisen Dai
Zheyuan Liu
Xiangqi Wang
Meng Jiang
Yapeng Tian
Xiangliang Zhang
VLM
MU
31
4
0
30 Oct 2024
Text-Guided Attention is All You Need for Zero-Shot Robustness in
  Vision-Language Models
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models
Lu Yu
Haiyang Zhang
Changsheng Xu
AAML
VLM
26
3
0
29 Oct 2024
Enhancing CTR Prediction in Recommendation Domain with Search Query
  Representation
Enhancing CTR Prediction in Recommendation Domain with Search Query Representation
Yuening Wang
M. Chen
Yaochen Hu
Wei Guo
Yingxue Zhang
Huifeng Guo
Y. Liu
Mark Coates
20
1
0
28 Oct 2024
Domain Adaptation with a Single Vision-Language Embedding
Domain Adaptation with a Single Vision-Language Embedding
Mohammad Fahes
Tuan-Hung Vu
Andrei Bursuc
Patrick Pérez
Raoul de Charette
VLM
28
0
0
28 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
  Tuning
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
26
0
0
23 Oct 2024
EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning
EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning
Yaxiong Wang
Y. Wang
Lianwei Wu
Lechao Cheng
Zhun Zhong
Meng Wang
VLM
35
0
0
23 Oct 2024
Captions Speak Louder than Images (CASLIE): Generalizing Foundation
  Models for E-commerce from High-quality Multimodal Instruction Data
Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
Xinyi Ling
B. Peng
Hanwen Du
Zhihui Zhu
Xia Ning
31
0
0
22 Oct 2024
IPL: Leveraging Multimodal Large Language Models for Intelligent Product
  Listing
IPL: Leveraging Multimodal Large Language Models for Intelligent Product Listing
Kang Chen
Qingheng Zhang
Chengbao Lian
Yixin Ji
Xuwei Liu
Shuguang Han
Guoqiang Wu
Fei Huang
Jufeng Chen
31
1
0
22 Oct 2024
Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation
Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation
Jiayu Xiong
Jing Wang
Hengjing Xiang
Jun Xue
Chen Xu
Zhouqiang Jiang
32
0
0
20 Oct 2024
BoostAdapter: Improving Vision-Language Test-Time Adaptation via
  Regional Bootstrapping
BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping
Taolin Zhang
J. T. Wang
Hang Guo
Tao Dai
Bin Chen
Shu-Tao Xia
VLM
TTA
19
0
0
20 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for
  Vision-Language Pre-Training
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
56
9
0
16 Oct 2024
Mind the Gap Between Prototypes and Images in Cross-domain Finetuning
Mind the Gap Between Prototypes and Images in Cross-domain Finetuning
Hongduan Tian
Feng Liu
Zhanke Zhou
Tongliang Liu
Chengqi Zhang
Bo Han
VLM
37
1
0
16 Oct 2024
A Survey of Low-shot Vision-Language Model Adaptation via Representer
  Theorem
A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
Kun Ding
Ying Wang
Gaofeng Meng
Shiming Xiang
VLM
31
0
0
15 Oct 2024
Multi-modal Vision Pre-training for Medical Image Analysis
Multi-modal Vision Pre-training for Medical Image Analysis
Shaohao Rui
Lingzhi Chen
Zhenyu Tang
Lilong Wang
M. Liu
S. Zhang
Xiaosong Wang
37
0
0
14 Oct 2024
Leveraging Customer Feedback for Multi-modal Insight Extraction
Leveraging Customer Feedback for Multi-modal Insight Extraction
Sandeep Sricharan Mukku
Abinesh Kanagarajan
Pushpendu Ghosh
Chetan Aggarwal
27
0
0
13 Oct 2024
Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
26
2
0
12 Oct 2024
Prompting Video-Language Foundation Models with Domain-specific
  Fine-grained Heuristics for Video Question Answering
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu
Kunhao Fu
Shuhui Wang
Qingming Huang
Jun Yu
46
0
0
12 Oct 2024
M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop
  Chain-of-Thought
M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought
G. Kumari
Kirtan Jain
Asif Ekbal
23
1
0
11 Oct 2024
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Anh-Quan Cao
M. Jaritz
Matthieu Guillaumin
Raoul de Charette
Loris Bazzani
VLM
CLIP
52
2
0
10 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with
  Mask Referring Modeling
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
32
5
0
10 Oct 2024
Previous
123456...222324
Next