ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
Transformer-empowered Multi-modal Item Embedding for Enhanced Image
  Search in E-Commerce
Transformer-empowered Multi-modal Item Embedding for Enhanced Image Search in E-Commerce
Chang Liu
Peng Hou
Anxiang Zeng
Hanwen Yu
127
2
0
29 Nov 2023
UniIR: Training and Benchmarking Universal Multimodal Information
  Retrievers
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei
Yang Chen
Haonan Chen
Hexiang Hu
Ge Zhang
Jie Fu
Alan Ritter
Wenhu Chen
88
70
0
28 Nov 2023
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf
  Vision-Language Models
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Jiayun Luo
Siddhesh Khandelwal
Leonid Sigal
Boyang Albert Li
MLLMVLM
136
8
0
28 Nov 2023
RISAM: Referring Image Segmentation via Mutual-Aware Attention Features
RISAM: Referring Image Segmentation via Mutual-Aware Attention Features
Mengxi Zhang
Yiming Liu
Xiangjun Yin
Huanjing Yue
Jingyu Yang
114
1
0
27 Nov 2023
SEGIC: Unleashing the Emergent Correspondence for In-Context
  Segmentation
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
Lingchen Meng
Shiyi Lan
Hengduo Li
Jose M. Alvarez
Zuxuan Wu
Yu-Gang Jiang
VLMISegMLLM
67
9
0
24 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
  Code-Vision Representation
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
81
12
0
22 Nov 2023
Multimodal Large Language Models: A Survey
Multimodal Large Language Models: A Survey
Jiayang Wu
Wensheng Gan
Zefeng Chen
Shicheng Wan
Philip S. Yu
95
195
0
22 Nov 2023
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with
  Spatial Relation Matching
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Meng Chu
Zhedong Zheng
Wei Ji
Tingyu Wang
Tat-Seng Chua
91
10
0
21 Nov 2023
From Wrong To Right: A Recursive Approach Towards Vision-Language
  Explanation
From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation
Jiaxin Ge
Sanjay Subramanian
Trevor Darrell
Boyi Li
LRM
104
4
0
21 Nov 2023
Active Prompt Learning in Vision Language Models
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
66
14
0
18 Nov 2023
MultiDelete for Multimodal Machine Unlearning
MultiDelete for Multimodal Machine Unlearning
Jiali Cheng
Hadi Amiri
MU
111
9
0
18 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
131
72
0
16 Nov 2023
RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection
RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection
Stefanos-Iordanis Papadopoulos
C. Koutlis
Symeon Papadopoulos
P. Petrantonakis
79
10
0
16 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLMMLLM
371
711
0
16 Nov 2023
Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval
Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval
Junyang Chen
Hanjiang Lai
VLM
130
15
0
13 Nov 2023
Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual
  Categorization Targeting Limited Samples
Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples
Ziye Fang
Xin Jiang
Hao Tang
Zechao Li
92
14
0
10 Nov 2023
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in
  Clutter
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Georgios Tziafas
Yucheng Xu
Arushi Goel
Mohammadreza Kasaei
Zhibin Li
Hamidreza Kasaei
89
27
0
09 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
  contextual modalities
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
141
21
0
09 Nov 2023
Zero-shot Translation of Attention Patterns in VQA Models to Natural
  Language
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language
Leonard Salewski
A. Sophia Koepke
Hendrik P. A. Lensch
Zeynep Akata
71
2
0
08 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task
  Completion
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion
Georgios Pantazopoulos
Malvina Nikandrou
Amit Parekh
Bhathiya Hemanthage
Arash Eshghi
Ioannis Konstas
Verena Rieser
Oliver Lemon
Alessandro Suglia
LM&Ro
77
7
0
07 Nov 2023
Enhancing Multimodal Compositional Reasoning of Visual Language Models
  with Generative Negative Mining
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
U. Sahin
Hang Li
Qadeer Ahmad Khan
Daniel Cremers
Volker Tresp
VLMCoGe
74
14
0
07 Nov 2023
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
Cheng Cheng
Lin Song
Ruoyi Xue
Hang Wang
Hongbin Sun
Yixiao Ge
Ying Shan
VLMObjD
116
26
0
07 Nov 2023
Fast and Interpretable Face Identification for Out-Of-Distribution Data
  Using Vision Transformers
Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers
Hai T. Phan
Cindy X. Le
Vu Le
Yihui He
Anh Totti Nguyen
63
3
0
06 Nov 2023
Visual Analytics for Efficient Image Exploration and User-Guided Image
  Captioning
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning
Yiran Li
Junpeng Wang
Prince Osei Aboagye
Michael Yeh
Yan Zheng
Liang Wang
Wei Zhang
Kwan-Liu Ma
75
3
0
02 Nov 2023
Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
  for Medical Image Captioning
Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning
Zhenyu Zhang
Benlu Wang
Weijie Liang
Yizhi Li
Xuechen Guo
Guanhong Wang
Shiyan Li
Gaoang Wang
MedImLM&MA
34
9
0
02 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
De-Diffusion Makes Text a Strong Cross-Modal Interface
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLMDiffM
103
11
0
01 Nov 2023
CROMA: Remote Sensing Representations with Contrastive Radar-Optical
  Masked Autoencoders
CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
A. Fuller
K. Millard
James R. Green
94
72
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
151
44
0
01 Nov 2023
Neuroformer: Multimodal and Multitask Generative Pretraining for Brain
  Data
Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data
Antonis Antoniades
Yiyi Yu
Joseph Canzano
William Wang
Spencer L. Smith
AI4CE
120
13
0
31 Oct 2023
Class Incremental Learning with Pre-trained Vision-Language Models
Class Incremental Learning with Pre-trained Vision-Language Models
Xialei Liu
Xusheng Cao
Haori Lu
Jia-Wen Xiao
Andrew D. Bagdanov
Ming-Ming Cheng
VLM
87
12
0
31 Oct 2023
SimMMDG: A Simple and Effective Framework for Multi-modal Domain
  Generalization
SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization
Hao Dong
Ismail Nejjar
Han Sun
Eleni Chatzi
Olga Fink
99
25
0
30 Oct 2023
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient
  image-text retrieval
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval
Youbo Lei
Feifei He
Chen Chen
Yingbin Mo
Sijia Li
Defeng Xie
H. Lu
VLM
87
0
0
30 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIPVLMVGen
106
2
0
30 Oct 2023
ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond
  Visual Common Sense
ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense
Kankan Zhou
Eason Lai
Wei Bin Au Yeong
K. Mouratidis
Jing Jiang
ReLMLRMVLM
67
20
0
30 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
  Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
97
34
0
29 Oct 2023
Foundational Models in Medical Imaging: A Comprehensive Survey and
  Future Vision
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad
Reza Azad
Sania Eskandari
Afshin Bozorgpour
Amirhossein Kazerouni
I. Rekik
Dorit Merhof
VLMMedIm
144
68
0
28 Oct 2023
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
Yuchen Suo
Linchao Zhu
Yi Yang
86
13
0
27 Oct 2023
SynergyNet: Bridging the Gap between Discrete and Continuous
  Representations for Precise Medical Image Segmentation
SynergyNet: Bridging the Gap between Discrete and Continuous Representations for Precise Medical Image Segmentation
Vandan Gorade
Sparsh Mittal
Debesh Jha
Ulas Bagci
69
11
0
26 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained
  Vision-and-Language Models
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
74
7
0
26 Oct 2023
Exploring Question Decomposition for Zero-Shot VQA
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan
B. Vijaykumar
S. Schulter
Manmohan Chandraker
Yun Fu
ReLM
62
12
0
25 Oct 2023
Towards Perceiving Small Visual Details in Zero-shot Visual Question
  Answering with Multimodal LLMs
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
81
2
0
24 Oct 2023
I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal
  Mutual Distillation
I2^22MD: 3D Action Representation Learning with Inter- and Intra-modal Mutual Distillation
Yunyao Mao
Jiajun Deng
Wen-gang Zhou
Zhenbo Lu
Wanli Ouyang
Houqiang Li
VLM
85
1
0
24 Oct 2023
Learning with Noisy Labels Using Collaborative Sample Selection and
  Contrastive Semi-Supervised Learning
Learning with Noisy Labels Using Collaborative Sample Selection and Contrastive Semi-Supervised Learning
Qing Miao
Xiaohe Wu
Chao Xu
Yanli Ji
Wangmeng Zuo
Yiwen Guo
Zhaopeng Meng
NoLa
85
5
0
24 Oct 2023
Linking Surface Facts to Large-Scale Knowledge Graphs
Linking Surface Facts to Large-Scale Knowledge Graphs
Gorjan Radevski
Kiril Gashteovski
Chia-Chien Hung
Carolin (Haas) Lawrence
Goran Glavaš
HILM
58
3
0
23 Oct 2023
Open-Set Image Tagging with Multi-Grained Text Supervision
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang
Yi-Jie Huang
Youcai Zhang
Weiwei Tian
Rui Feng
Yuejie Zhang
Yanchun Xie
Yaqian Li
Lei Zhang
VLM
87
35
0
23 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
82
4
0
20 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network
  for VL Representation
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
61
0
0
20 Oct 2023
SILC: Improving Vision Language Pretraining with Self-Distillation
SILC: Improving Vision Language Pretraining with Self-Distillation
Muhammad Ferjad Naeem
Yongqin Xian
Xiaohua Zhai
Lukas Hoyer
Luc Van Gool
F. Tombari
VLM
110
36
0
20 Oct 2023
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and
  Gallery Banks
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
Yimu Wang
Xiangru Jian
Bo Xue
55
11
0
17 Oct 2023
EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset
EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset
Hang Yin
Pinren Lu
Ziang Li
Bin Sun
Kan Li
96
0
0
17 Oct 2023
Previous
123...111213...232425
Next