ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
ManiNeg: Manifestation-guided Multimodal Pretraining for Mammography
  Classification
ManiNeg: Manifestation-guided Multimodal Pretraining for Mammography Classification
Xujun Li
Xin Wei
Jing Jiang
Danxiang Chen
Wei Zhang
Jinpeng Li
106
0
0
24 Sep 2024
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation
  of Visual-knowledge-linguistic Features
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features
Xin Wei
Yaling Tao
Changde Du
Gangming Zhao
Yizhou Yu
Jinpeng Li
95
0
0
24 Sep 2024
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Hong Chen
Xin Wang
Yuwei Zhou
Bin Huang
Yipeng Zhang
Wei Feng
Houlun Chen
Zeyang Zhang
Siao Tang
Wenwu Zhu
DiffM
136
9
0
23 Sep 2024
Exploring Fine-grained Retail Product Discrimination with Zero-shot
  Object Classification Using Vision-Language Models
Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models
Anil Osman Tur
Alessandro Conti
Cigdem Beyan
Davide Boscaini
Roberto Larcher
S. Messelodi
Fabio Poiesi
Elisa Ricci
VLM
108
0
0
23 Sep 2024
PLOT: Text-based Person Search with Part Slot Attention for
  Corresponding Part Discovery
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery
Jicheol Park
Dongwon Kim
Boseung Jeong
Suha Kwak
100
4
0
20 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGeVLM
171
2
0
19 Sep 2024
Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation
Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation
Qilong Zhangli
Di Liu
Abhishek Aich
Dimitris Metaxas
S. Schulter
73
0
0
15 Sep 2024
FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language
  Model in Wheeled Robots
FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots
Bo Peng
D. Baek
Qijie Wang
Joao Ramos
73
0
0
15 Sep 2024
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
Yiyi Tao
Zhuoyue Wang
Hang Zhang
Lun Wang
VLM
99
16
0
15 Sep 2024
Generating Event-oriented Attribution for Movies via Two-Stage
  Prefix-Enhanced Multimodal LLM
Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM
Yuanjie Lyu
Tong Xu
Zihan Niu
Bo Peng
Jing Ke
Enhong Chen
64
0
0
14 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGeVLM
56
0
0
12 Sep 2024
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization
Ling Xing
Hongyu Qu
Rui Yan
Xiangbo Shu
Jinhui Tang
161
2
0
12 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP
  Perspective
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
105
6
0
11 Sep 2024
INTRA: Interaction Relationship-aware Weakly Supervised Affordance
  Grounding
INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding
Ji Ha Jang
H. Seo
Se Young Chun
93
3
0
10 Sep 2024
Revisiting Prompt Pretraining of Vision-Language Models
Revisiting Prompt Pretraining of Vision-Language Models
Zhenyuan Chen
Lingfeng Yang
Shuo Chen
Zhaowei Chen
Jiajun Liang
Xiang Li
MLLMVPVLMVLM
119
2
0
10 Sep 2024
Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion
  Recognition in Conversations
Mamba-Enhanced Text-Audio-Video Alignment Network for Emotion Recognition in Conversations
Xinran Li
Xiaomao Fan
Q. Wu
Xiaojiang Peng
Yongqian Li
Mamba
81
1
0
08 Sep 2024
CV-Probes: Studying the interplay of lexical and world knowledge in
  visually grounded verb understanding
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
Ivana Beňová
Michal Gregor
Albert Gatt
74
1
0
02 Sep 2024
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen
Tianxiang Hao
Tao He
Sicheng Zhao
Pengzhang Liu
Yongjun Bao
Guiguang Ding
Guiguang Ding
264
15
0
02 Sep 2024
How Does Diverse Interpretability of Textual Prompts Impact Medical
  Vision-Language Zero-Shot Tasks?
How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?
Sicheng Wang
Che Liu
Rossella Arcucci
VLMMedIm
138
0
0
31 Aug 2024
Medical Report Generation Is A Multi-label Classification Problem
Medical Report Generation Is A Multi-label Classification Problem
Yijian Fan
Zhenbang Yang
Rui Liu
Mingjie Li
Xiaojun Chang
MedIm
131
1
0
30 Aug 2024
Adapting Vision-Language Models to Open Classes via Test-Time Prompt
  Tuning
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning
Zhengqing Gao
Xiang Ao
Xu-Yao Zhang
Cheng-Lin Liu
VLMVPVLM
89
0
0
29 Aug 2024
Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach
Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach
Mian Zou
Baosheng Yu
Yibing Zhan
Siwei Lyu
Kede Ma
CVBM
136
2
0
29 Aug 2024
NeuralOOD: Improving Out-of-Distribution Generalization Performance with
  Brain-machine Fusion Learning Framework
NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework
Shuangchen Zhao
Changde Du
Hui Li
Huiguang He
72
0
0
27 Aug 2024
Evaluating Attribute Comprehension in Large Vision-Language Models
Evaluating Attribute Comprehension in Large Vision-Language Models
Haiwen Zhang
Zixi Yang
Yuanzhi Liu
Xinran Wang
Zheqi He
Kongming Liang
Zhanyu Ma
ELM
53
0
0
25 Aug 2024
Online Zero-Shot Classification with CLIP
Online Zero-Shot Classification with CLIP
Qi Qian
Juhua Hu
VLM
75
7
0
23 Aug 2024
TRRG: Towards Truthful Radiology Report Generation With Cross-modal
  Disease Clue Enhanced Large Language Model
TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model
Yuhao Wang
Chao Hao
Yawen Cui
Xinqi Su
Weicheng Xie
Tao Tan
Zitong Yu
LM&MAMedIm
70
0
0
22 Aug 2024
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in
  Visual Question Answering
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering
Yuliang Cai
Mohammad Rostami
CLLVLMMLLM
128
4
0
21 Aug 2024
PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection
Tri Cao
Chengyu Huang
Yuexin Li
Huilin Wang
Amy He
Nay Oo
Bryan Hooi
LLMAGOffRL
166
7
0
20 Aug 2024
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared
  Person Re-Identification
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification
Yonggan Wu
Ling-Chao Meng
Yuan Zichao
Sixian Chan
Hong-Qiang Wang
103
4
0
20 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language
  Compositionality
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality
Chaofan Tao
Gukyeong Kwon
Varad Gunjal
Hao Yang
Zhaowei Cai
Yonatan Dukler
Ashwin Swaminathan
R. Manmatha
Colin Jon Taylor
Stefano Soatto
CoGe
61
0
0
18 Aug 2024
CLIP-CID: Efficient CLIP Distillation via Cluster-Instance
  Discrimination
CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination
Kaicheng Yang
Tiancheng Gu
Xiang An
Haiqiang Jiang
Xiangzi Dai
Ziyong Feng
Weidong Cai
Jiankang Deng
VLM
99
8
0
18 Aug 2024
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal
  Emotion Recognition
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition
Qifei Li
Yingming Gao
Yuhua Wen
Cong Wang
Ya Li
61
1
0
18 Aug 2024
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing
  Speech-Image Retrieval
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval
Lifeng Zhou
Yuke Li
Rui Deng
Yuting Yang
Haoqi Zhu
67
0
0
15 Aug 2024
End-to-end Semantic-centric Video-based Multimodal Affective Computing
End-to-end Semantic-centric Video-based Multimodal Affective Computing
Ronghao Lin
Ying Zeng
Sijie Mai
Haifeng Hu
VGen
118
0
0
14 Aug 2024
Cross-aware Early Fusion with Stage-divided Vision and Language
  Transformer Encoders for Referring Image Segmentation
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation
Yubin Cho
Hyunwoo Yu
Suk-Ju Kang
129
21
0
14 Aug 2024
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
Jingyun Wang
Guoliang Kang
VLMSSL
103
7
0
13 Aug 2024
Contrastive masked auto-encoders based self-supervised hashing for 2D
  image and 3D point cloud cross-modal retrieval
Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval
Rukai Wei
Heng Cui
Yu Liu
Yufeng Hou
Yanzhao Xie
Ke Zhou
3DPC
47
0
0
11 Aug 2024
Sample-agnostic Adversarial Perturbation for Vision-Language
  Pre-training Models
Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models
Haonan Zheng
Wen Jiang
Xinyang Deng
Wenrui Li
VLMAAML
58
4
0
06 Aug 2024
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
Ruixiang Zhao
Jian Jia
Yan Li
Xuehan Bai
Quan Chen
Han Li
Peng Jiang
Xirong Li
78
0
0
06 Aug 2024
Multistain Pretraining for Slide Representation Learning in Pathology
Multistain Pretraining for Slide Representation Learning in Pathology
Guillaume Jaume
Anurag J. Vaidya
Andrew Zhang
Andrew H. Song
Richard J. Chen
S. Sahai
Dandan Mo
Emilio Madrigal
L. Le
Faisal Mahmood
109
14
0
05 Aug 2024
From Attributes to Natural Language: A Survey and Foresight on
  Text-based Person Re-identification
From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification
Fanzhi Jiang
Su Yang
Mark W. Jones
Liumei Zhang
102
1
0
31 Jul 2024
MarvelOVD: Marrying Object Recognition and Vision-Language Models for
  Robust Open-Vocabulary Object Detection
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
Kuo Wang
Lechao Cheng
Weikai Chen
Pingping Zhang
Liang Lin
Fan Zhou
Guanbin Li
VLMObjD
76
3
0
31 Jul 2024
Image-text matching for large-scale book collections
Image-text matching for large-scale book collections
Artemis LLabres
Arka Ujjal Dey
Dimosthenis Karatzas
Ernest Valveny
37
0
0
29 Jul 2024
WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
Jingjing Wu
Zhengyao Fang
Pengyuan Lyu
Chengquan Zhang
Fanglin Chen
Guangming Lu
Wenjie Pei
144
3
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
99
0
0
28 Jul 2024
Data Processing Techniques for Modern Multimodal Models
Data Processing Techniques for Modern Multimodal Models
Yinheng Li
Han Ding
Hang Chen
VLM
87
0
0
27 Jul 2024
Diffusion Models for Multi-Task Generative Modeling
Diffusion Models for Multi-Task Generative Modeling
Changyou Chen
Han Ding
Bunyamin Sisman
Yi Tian Xu
Ouye Xie
Benjamin Z. Yao
Son Dinh Tran
Belinda Zeng
DiffM
91
5
0
24 Jul 2024
Selective Vision-Language Subspace Projection for Few-shot CLIP
Selective Vision-Language Subspace Projection for Few-shot CLIP
Xingyu Zhu
Beier Zhu
Yi Tan
Shuo Wang
Yanbin Hao
Haiqi Zhang
VLM
112
4
0
24 Jul 2024
LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with
  Class Taxonomies
LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies
Jia Shi
Gautam Gare
Jinjin Tian
Siqi Chai
Zhiqiu Lin
Arun Vasudevan
Di Feng
Francesco Ferroni
Shu Kong
VLMOODDOOD
100
6
0
22 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with
  Extensive Diversity
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
112
29
0
22 Jul 2024
Previous
123456...232425
Next