ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07651
  4. Cited By
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
v1v2 (latest)

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
    FaML
ArXiv (abs)PDFHTMLGithub (1658★)

Papers citing "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"

50 / 1,231 papers shown
Title
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph
  Generation
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
Kibum Kim
Kanghoon Yoon
Jaeyeong Jeon
Yeonjun In
Jinyoung Moon
Donghyun Kim
Chanyoung Park
149
18
0
16 Oct 2023
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
Yangyang Guo
Guangzhi Wang
Mohan S. Kankanhalli
41
3
0
16 Oct 2023
Few-shot Action Recognition with Captioning Foundation Models
Few-shot Action Recognition with Captioning Foundation Models
Xiang Wang
Shiwei Zhang
Hangjie Yuan
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
VLM
126
7
0
16 Oct 2023
Extending Multi-modal Contrastive Representations
Extending Multi-modal Contrastive Representations
Zehan Wang
Ziang Zhang
Luping Liu
Yang Zhao
Haifeng Huang
Tao Jin
Zhou Zhao
63
7
0
13 Oct 2023
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
  Instruction Tuning
Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning
Junyu Lu
Di Zhang
Xiaojun Wu
Xinyu Gao
Ruyi Gan
Jiaxing Zhang
Yan Song
Pingjian Zhang
VLMMLLM
55
7
0
12 Oct 2023
Heuristic Vision Pre-Training with Self-Supervised and Supervised
  Multi-Task Learning
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning
Zhiming Qian
VLMSSL
54
0
0
11 Oct 2023
IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training
IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training
Che Liu
Sibo Cheng
Miaojing Shi
Anand Shah
Wenjia Bai
Rossella Arcucci
94
27
0
11 Oct 2023
Distilling Efficient Vision Transformers from CNNs for Semantic
  Segmentation
Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation
Xueye Zheng
Yunhao Luo
Pengyuan Zhou
Lin Wang
79
15
0
11 Oct 2023
The Solution for the CVPR2023 NICE Image Captioning Challenge
The Solution for the CVPR2023 NICE Image Captioning Challenge
Xiangyu Wu
Yi Gao
Hailiang Zhang
Yang Yang
Weili Guo
Jianfeng Lu
58
1
0
10 Oct 2023
Negative Object Presence Evaluation (NOPE) to Measure Object
  Hallucination in Vision-Language Models
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia
Wenliang Dai
Samuel Cahyawijaya
Ziwei Ji
Pascale Fung
MLLM
105
53
0
09 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video
  Understanding Tasks
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
69
0
0
07 Oct 2023
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
  Pre-trained Models
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Ziyi Yin
Muchao Ye
Tianrong Zhang
Tianyu Du
Jinguo Zhu
Han Liu
Jinghui Chen
Ting Wang
Fenglong Ma
AAMLVLMCoGe
89
44
0
07 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
104
10
0
06 Oct 2023
Expedited Training of Visual Conditioned Language Generation via
  Redundancy Reduction
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
72
12
0
05 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
92
14
0
04 Oct 2023
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object
  Detection
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection
Shilin Xu
Xiangtai Li
Size Wu
Wenwei Zhang
Yunhai Tong
Chen Change Loy
ObjDVLM
60
0
0
02 Oct 2023
NEUCORE: Neural Concept Reasoning for Composed Image Retrieval
NEUCORE: Neural Concept Reasoning for Composed Image Retrieval
Shu Zhao
Huijuan Xu
55
6
0
02 Oct 2023
Analyzing and Mitigating Object Hallucination in Large Vision-Language
  Models
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Yiyang Zhou
Chenhang Cui
Jaehong Yoon
Linjun Zhang
Zhun Deng
Chelsea Finn
Mohit Bansal
Huaxiu Yao
MLLM
167
186
0
01 Oct 2023
Reformulating Vision-Language Foundation Models and Datasets Towards
  Universal Multimodal Assistants
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Tianyu Yu
Jinyi Hu
Yuan Yao
Haoye Zhang
Yue Zhao
...
Jiao Xue
Dahai Li
Zhiyuan Liu
Hai-Tao Zheng
Maosong Sun
VLMMLLM
45
20
0
01 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large
  Multimodal Models with In-Context Learning
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRMMLLM
113
22
0
01 Oct 2023
Fewshot learning on global multimodal embeddings for earth observation
  tasks
Fewshot learning on global multimodal embeddings for earth observation tasks
Matt Allen
Francisco Dorr
Joseph A. Gallego-Mejia
Laura Martínez-Ferrer
Anna Jungbluth
F. Kalaitzis
Raúl Ramos-Pollán
77
9
0
29 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
69
3
0
28 Sep 2023
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal
  Sponsored Search
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search
Yuanmin Tang
Daling Wang
Keke Gai
Wenfang Wu
Yifei Zhang
Gang Xiong
Qi Wu
73
4
0
28 Sep 2023
Context-I2W: Mapping Images to Context-dependent Words for Accurate
  Zero-Shot Composed Image Retrieval
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang
Jiahao Yu
Keke Gai
Jiamin Zhuang
Gang Xiong
Yue Hu
Qi Wu
83
39
0
28 Sep 2023
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
Hila Levi
Guy Heller
Dan Levi
Ethan Fetaya
OCLVLM
69
4
0
26 Sep 2023
Robust Sequential DeepFake Detection
Robust Sequential DeepFake Detection
R. Shao
Tianxing Wu
Ziwei Liu
ViTAAML
58
8
0
26 Sep 2023
BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile
  Screenshot Captioning
BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning
Ching-Yu Chiang
I-Hua Chang
Shih-Wei Liao
83
1
0
26 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
63
10
0
26 Sep 2023
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
  Multi-Modal Causal Attention
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Z. Yao
Xiaoxia Wu
Conglong Li
Minjia Zhang
Heyang Qi
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
93
11
0
25 Sep 2023
Detecting and Grounding Multi-Modal Media Manipulation and Beyond
Detecting and Grounding Multi-Modal Media Manipulation and Beyond
Rui Shao
Tianxing Wu
Jianlong Wu
Liqiang Nie
Ziwei Liu
80
27
0
25 Sep 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
98
28
0
25 Sep 2023
Survey of Social Bias in Vision-Language Models
Survey of Social Bias in Vision-Language Models
Nayeon Lee
Yejin Bang
Holy Lovenia
Samuel Cahyawijaya
Wenliang Dai
Pascale Fung
VLM
126
19
0
24 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
  Inheritance
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance
Kan Wu
Houwen Peng
Zhenghong Zhou
Bin Xiao
Mengchen Liu
...
Xi
Xi Chen
Xinggang Wang
Hongyang Chao
Han Hu
VLMOODD
86
64
0
21 Sep 2023
BELT:Bootstrapping Electroencephalography-to-Language Decoding and Zero-Shot Sentiment Classification by Natural Language Supervision
Jinzhao Zhou
Yiqun Duan
Yu-Cheng Chang
Yu-Kai Wang
Chin-Teng Lin
74
6
0
21 Sep 2023
StructChart: Perception, Structuring, Reasoning for Visual Chart
  Understanding
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
Renqiu Xia
Bo Zhang
Hao Peng
Hancheng Ye
Xiangchao Yan
Peng Ye
Botian Shi
Yu Qiao
Junchi Yan
116
0
0
20 Sep 2023
Image-Text Pre-Training for Logo Recognition
Image-Text Pre-Training for Logo Recognition
Mark Hubenthal
Suren Kumar
VLM
93
3
0
18 Sep 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Joey Tianyi Zhou
208
49
0
18 Sep 2023
Unified Frequency-Assisted Transformer Framework for Detecting and
  Grounding Multi-Modal Manipulation
Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation
Huan Liu
Zichang Tan
Qiang Chen
Yunchao Wei
Yao-Min Zhao
Jingdong Wang
58
9
0
18 Sep 2023
Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking
Wenzhang Wei
Zhipeng Gui
Changguang Wu
Anqi Zhao
D. Peng
Huayi Wu
77
0
0
15 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging
  Image-Text Auxiliary Tasks
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
64
3
0
14 Sep 2023
DePT: Decoupled Prompt Tuning
DePT: Decoupled Prompt Tuning
Ji Zhang
Shihan Wu
Lianli Gao
Hengtao Shen
Jingkuan Song
VLM
77
33
0
14 Sep 2023
TAP: Targeted Prompting for Task Adaptive Generation of Textual Training
  Instances for Visual Classification
TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification
M. Jehanzeb Mirza
Leonid Karlinsky
Wei Lin
Horst Possegger
Rogerio Feris
Horst Bischof
VLM
66
6
0
13 Sep 2023
Language Models as Black-Box Optimizers for Vision-Language Models
Language Models as Black-Box Optimizers for Vision-Language Models
Shihong Liu
Zhiqiu Lin
Samuel Yu
Ryan Lee
Tiffany Ling
Deepak Pathak
Deva Ramanan
VLM
126
30
0
12 Sep 2023
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on
  Biosignals
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals
Ran Liu
Ellen L. Zippi
Hadi Pouransari
Chris Sandino
Jingping Nie
Hanlin Goh
Erdrin Azemi
Ali Moin
96
12
0
12 Sep 2023
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
  Retrieval
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval
Yabing Wang
Shuhui Wang
Hao Luo
Jianfeng Dong
F. Wang
Meng Han
Xun Wang
Meng Wang
76
9
0
11 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language
  Models
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
99
27
0
08 Sep 2023
Context-Aware Prompt Tuning for Vision-Language Model with
  Dual-Alignment
Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment
Hongyu Hu
Tiancheng Lin
Jie Wang
Zhenbang Sun
Yi Xu
MLLMVLMVPVLM
43
1
0
08 Sep 2023
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis
Jiapeng Zhu
Ceyuan Yang
Kecheng Zheng
Yinghao Xu
Zifan Shi
Yujun Shen
MoE
97
8
0
07 Sep 2023
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Zigang Geng
Binxin Yang
Tiankai Hang
Chen Li
Shuyang Gu
...
Jianmin Bao
Zheng Zhang
Han Hu
DongDong Chen
Baining Guo
DiffMVLM
118
107
0
07 Sep 2023
Previous
123...121314...232425
Next