Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
FaML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,195 papers shown
Title
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
S. Nagendra
Kashif Rashid
Chaopeng Shen
Daniel Kifer
VLM
76
2
0
16 Dec 2024
Does VLM Classification Benefit from LLM Description Semantics?
Pingchuan Ma
Lennart Rietdorf
Dmytro Kotovenko
Vincent Tao Hu
Bjorn Ommer
VLM
74
1
0
16 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
91
1
0
16 Dec 2024
ViSymRe: Vision-guided Multimodal Symbolic Regression
Da Li
Junping Yin
Jin Xu
Xinxin Li
Juan Zhang
85
1
0
15 Dec 2024
AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA
Gorden Liu
Yu Sun
R.-H. Sun
Xin Dong
Hongyu Xiong
LLMAG
85
1
0
15 Dec 2024
Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation
Yang Yang
Wenjuan Xi
Luping Zhou
Jinhui Tang
77
0
0
14 Dec 2024
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Haoyu Jiang
Zhi-Qi Cheng
Gabriel Moreira
Jiawen Zhu
Jingdong Sun
Bukun Ren
Jun-Yan He
Qi Dai
Xian-Sheng Hua
VLM
90
0
0
14 Dec 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
107
7
0
11 Dec 2024
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu
Mengyang Wu
Yuzhi Zhao
Jason Chun Lok Li
Weifeng Ou
LRM
SyDa
VLM
71
2
0
09 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
93
0
0
04 Dec 2024
Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
Kun Qian
Tianyu Sun
Wenhong Wang
71
0
0
01 Dec 2024
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du
Haoxin Li
Jianfei Yu
Boyang Li
152
0
0
01 Dec 2024
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang
Dasol Jeong
Hyunmin Lee
Sangwoo Park
Hasil Park
Sunkyu Kwon
Yeongjoon Kim
Joonki Paik
MLLM
VLM
79
0
0
27 Nov 2024
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang
Yaxiong Wang
Li Zhu
Zhedong Zheng
98
2
0
26 Nov 2024
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Jungeun Kim
Hyeongwoo Jeon
Jongseong Bae
Ha Young Kim
SLR
85
0
0
25 Nov 2024
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
Yuhang Yang
Jinhong Deng
Wen Li
Lixin Duan
VLM
81
0
0
24 Nov 2024
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam
Christopher Thomas
AAML
121
3
0
23 Nov 2024
Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval
Zengbao Sun
Ming Zhao
Gaorui Liu
Andre Kaup
96
3
0
22 Nov 2024
ORID: Organ-Regional Information Driven Framework for Radiology Report Generation
Tiancheng Gu
Kaicheng Yang
Xiang An
Ziyong Feng
Dongnan Liu
Weidong Cai
74
1
0
20 Nov 2024
Joint Vision-Language Social Bias Removal for CLIP
Haoyu Zhang
Yangyang Guo
Mohan S. Kankanhalli
VLM
72
0
0
19 Nov 2024
SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models
M. Ginting
Dong-Ki Kim
Sung-Kyun Kim
Bandi Jai Krishna
Mykel J. Kochenderfer
Shayegan Omidshafiei
Ali-akbar Agha-mohammadi
LM&Ro
73
0
0
18 Nov 2024
TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation
Ranmin Wang
Limin Zhuang
Hongkun Chen
Boyan Xu
Ruichu Cai
41
0
0
18 Nov 2024
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
M. Arda Aydın
Efe Mert Çırpar
Elvin Abdinli
Gözde B. Ünal
Y. Sahin
VLM
71
0
0
18 Nov 2024
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
Jonathan Fhima
Elad Ben Avraham
Oren Nuriel
Yair Kittenplon
Roy Ganz
Aviad Aberdam
Ron Litman
VLM
34
1
0
07 Nov 2024
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack
Xiaojun Jia
Sensen Gao
Qing-Wu Guo
Ke Ma
Yihao Huang
Simeng Qin
Yang Liu
Ivor Tsang Fellow
Xiaochun Cao
AAML
46
3
0
04 Nov 2024
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
Ehsan Faghihi
Mohammedreza Zarenejad
Ali-Asghar Beheshti Shirazi
44
0
0
04 Nov 2024
Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification
Shengxun Wei
Zan Gao
Yibo Zhao
Weili Guan
Weili Guan
Shengyong Chen
46
1
0
01 Nov 2024
Nearest Neighbor Normalization Improves Multimodal Retrieval
Neil Chowdhury
Franklin Wang
Sumedh Shenoy
Douwe Kiela
Sarah Schwettmann
Tristan Thrush
VLM
32
3
0
31 Oct 2024
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval
Haiwen Li
Fei Su
Zhicheng Zhao
31
0
0
31 Oct 2024
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang
Maixuan Xue
Xinran Liu
Zheng Pan
Xing Wei
62
1
0
31 Oct 2024
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Tianyu Yang
Lisen Dai
Zheyuan Liu
Xiangqi Wang
Meng Jiang
Yapeng Tian
Xiangliang Zhang
VLM
MU
31
4
0
30 Oct 2024
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models
Lu Yu
Haiyang Zhang
Changsheng Xu
AAML
VLM
26
3
0
29 Oct 2024
Enhancing CTR Prediction in Recommendation Domain with Search Query Representation
Yuening Wang
M. Chen
Yaochen Hu
Wei Guo
Yingxue Zhang
Huifeng Guo
Y. Liu
Mark Coates
20
1
0
28 Oct 2024
Domain Adaptation with a Single Vision-Language Embedding
Mohammad Fahes
Tuan-Hung Vu
Andrei Bursuc
Patrick Pérez
Raoul de Charette
VLM
28
0
0
28 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
26
0
0
23 Oct 2024
EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning
Yaxiong Wang
Y. Wang
Lianwei Wu
Lechao Cheng
Zhun Zhong
Meng Wang
VLM
35
0
0
23 Oct 2024
Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
Xinyi Ling
B. Peng
Hanwen Du
Zhihui Zhu
Xia Ning
31
0
0
22 Oct 2024
IPL: Leveraging Multimodal Large Language Models for Intelligent Product Listing
Kang Chen
Qingheng Zhang
Chengbao Lian
Yixin Ji
Xuwei Liu
Shuguang Han
Guoqiang Wu
Fei Huang
Jufeng Chen
31
1
0
22 Oct 2024
Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation
Jiayu Xiong
Jing Wang
Hengjing Xiang
Jun Xue
Chen Xu
Zhouqiang Jiang
32
0
0
20 Oct 2024
BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping
Taolin Zhang
J. T. Wang
Hang Guo
Tao Dai
Bin Chen
Shu-Tao Xia
VLM
TTA
19
0
0
20 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
56
9
0
16 Oct 2024
Mind the Gap Between Prototypes and Images in Cross-domain Finetuning
Hongduan Tian
Feng Liu
Zhanke Zhou
Tongliang Liu
Chengqi Zhang
Bo Han
VLM
37
1
0
16 Oct 2024
A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
Kun Ding
Ying Wang
Gaofeng Meng
Shiming Xiang
VLM
31
0
0
15 Oct 2024
Multi-modal Vision Pre-training for Medical Image Analysis
Shaohao Rui
Lingzhi Chen
Zhenyu Tang
Lilong Wang
M. Liu
S. Zhang
Xiaosong Wang
37
0
0
14 Oct 2024
Leveraging Customer Feedback for Multi-modal Insight Extraction
Sandeep Sricharan Mukku
Abinesh Kanagarajan
Pushpendu Ghosh
Chetan Aggarwal
27
0
0
13 Oct 2024
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
26
2
0
12 Oct 2024
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu
Kunhao Fu
Shuhui Wang
Qingming Huang
Jun Yu
46
0
0
12 Oct 2024
M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought
G. Kumari
Kirtan Jain
Asif Ekbal
23
1
0
11 Oct 2024
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Anh-Quan Cao
M. Jaritz
Matthieu Guillaumin
Raoul de Charette
Loris Bazzani
VLM
CLIP
52
2
0
10 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
32
5
0
10 Oct 2024
Previous
1
2
3
4
5
6
...
22
23
24
Next