ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.15857
  4. Cited By
A Survey on Image-text Multimodal Models

A Survey on Image-text Multimodal Models

23 September 2023
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
    VLM
ArXivPDFHTML

Papers citing "A Survey on Image-text Multimodal Models"

50 / 108 papers shown
Title
A kinetic-based regularization method for data science applications
Abhisek Ganguly
Alessandro Gabbana
Vybhav Rao
Sauro Succi
Santosh Ansumali
106
1
0
06 Mar 2025
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for
  Transformers
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
Reduan Achtibat
Sayed Mohammad Vakilzadeh Hatefi
Maximilian Dreyer
Aakriti Jain
Thomas Wiegand
Sebastian Lapuschkin
Wojciech Samek
54
35
0
08 Feb 2024
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Xiangxiang Chu
Limeng Qiao
Xinyu Zhang
Shuang Xu
Fei Wei
...
Xiaofei Sun
Yiming Hu
Xinyang Lin
Bo Zhang
Chunhua Shen
VLM
MLLM
65
105
0
06 Feb 2024
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts
Ziheng Zhao
Yao Zhang
Chaoyi Wu
Xiaoman Zhang
Ya Zhang
Yanfeng Wang
Weidi Xie
VLM
MedIm
86
39
0
28 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and
  Comprehension via Semantic-aware Visual Objects
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLM
VLM
61
16
0
08 Dec 2023
A Survey on Multimodal Large Language Models for Autonomous Driving
A Survey on Multimodal Large Language Models for Autonomous Driving
Can Cui
Yunsheng Ma
Xu Cao
Wenqian Ye
Yang Zhou
...
Xinrui Yan
Shuqi Mei
Jianguo Cao
Ziran Wang
Chao Zheng
103
281
0
21 Nov 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
...
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
78
119
0
09 Nov 2023
BiomedJourney: Counterfactual Biomedical Image Generation by
  Instruction-Learning from Multimodal Patient Journeys
BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys
Yu Gu
Jianwei Yang
Naoto Usuyama
Chun-yue Li
Sheng Zhang
M. Lungren
Jianfeng Gao
Hoifung Poon
MedIm
60
24
0
16 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
51
14
0
04 Oct 2023
Masked Vision and Language Pre-training with Unimodal and Multimodal
  Contrastive Losses for Medical Visual Question Answering
Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering
Pengfei Li
Gang Liu
Jinlong He
Zixu Zhao
Shenjun Zhong
26
37
0
11 Jul 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
80
122
0
18 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
78
84
0
17 May 2023
Graph Neural Networks in Vision-Language Image Understanding: A Survey
Graph Neural Networks in Vision-Language Image Understanding: A Survey
Henry Senior
Greg Slabaugh
Shanxin Yuan
Luca Rossi
GNN
45
17
0
07 Mar 2023
From colouring-in to pointillism: revisiting semantic segmentation
  supervision
From colouring-in to pointillism: revisiting semantic segmentation supervision
Rodrigo Benenson
V. Ferrari
VLM
47
20
0
25 Oct 2022
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
Zhe Gan
Linjie Li
Chunyuan Li
Lijuan Wang
Zicheng Liu
Jianfeng Gao
VLM
51
164
0
17 Oct 2022
LAION-5B: An open large-scale dataset for training next generation
  image-text models
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLM
MLLM
CLIP
150
3,444
0
16 Oct 2022
GLM-130B: An Open Bilingual Pre-trained Model
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
344
1,091
0
05 Oct 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
133
639
0
22 Aug 2022
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
  Image-Text Retrieval
CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval
Haoran Wang
Dongliang He
Wenhao Wu
Boyang Xia
Min Yang
Fu Li
YunLong Yu
Zhong Ji
Errui Ding
Jingdong Wang
45
23
0
21 Aug 2022
Reasoning about Actions over Visual and Linguistic Modalities: A Survey
Reasoning about Actions over Visual and Linguistic Modalities: A Survey
Shailaja Keyur Sampat
Maitreya Patel
Subhasish Das
Yezhou Yang
Chitta Baral
ReLM
LM&Ro
LRM
47
12
0
15 Jul 2022
A survey of multimodal deep generative models
A survey of multimodal deep generative models
Masahiro Suzuki
Y. Matsuo
SyDa
DRL
65
78
0
05 Jul 2022
LViT: Language meets Vision Transformer in Medical Image Segmentation
LViT: Language meets Vision Transformer in Medical Image Segmentation
Zihan Li
Yunxiang Li
Qingde Li
Puyang Wang
Dazhou Guo
Le Lu
D. Jin
You Zhang
Qingqi Hong
VLM
MedIm
88
138
0
29 Jun 2022
Surgical-VQA: Visual Question Answering in Surgical Scenes using
  Transformer
Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer
Lalithkumar Seenivasan
Mobarakol Islam
Adithya K. Krishna
Hongliang Ren
MedIm
36
47
0
22 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
Peng Xu
Xiatian Zhu
David Clifton
ViT
125
555
0
13 Jun 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
125
549
0
27 May 2022
i-Code: An Integrative and Composable Multimodal Learning Framework
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
DongDong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
53
48
0
03 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
364
3,535
0
29 Apr 2022
Vision-and-Language Pretrained Models: A Survey
Vision-and-Language Pretrained Models: A Survey
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
61
65
0
15 Apr 2022
Are Multimodal Transformers Robust to Missing Modality?
Are Multimodal Transformers Robust to Missing Modality?
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
90
153
0
12 Apr 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
464
6,231
0
05 Apr 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
33
21
0
17 Mar 2022
Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Ran Chen
Hanli Wang
Lei Wang
Sam Kwong
35
9
0
10 Mar 2022
LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network
  using Transformers for Cross-Modal Information Retrieval in Histopathology
  Archives
LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives
Danial Maleki
H. R Tizhoosh
MedIm
56
12
0
02 Mar 2022
Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
  Recognition?
Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?
Vandana Rajan
Alessio Brutti
Andrea Cavallaro
63
35
0
18 Feb 2022
A Survey of Vision-Language Pre-Trained Models
A Survey of Vision-Language Pre-Trained Models
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
97
183
0
18 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
141
873
0
07 Feb 2022
Deep Learning Approaches on Image Captioning: A Review
Deep Learning Approaches on Image Captioning: A Review
Taraneh Ghandi
H. Pourreza
H. Mahyar
VLM
73
98
0
31 Jan 2022
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's
  Progressive Matrices
Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices
Mikolaj Malkiñski
Jacek Mańdziuk
157
43
0
28 Jan 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
513
4,340
0
28 Jan 2022
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
  Review on Evaluating Explainable AI
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI
Meike Nauta
Jan Trienes
Shreyasi Pathak
Elisa Nguyen
Michelle Peters
Yasmin Schmitt
Jorg Schlotterer
M. V. Keulen
C. Seifert
ELM
XAI
86
409
0
20 Jan 2022
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
88
706
0
08 Dec 2021
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic
  Arithmetic
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
Yoad Tewel
Yoav Shalev
Idan Schwartz
Lior Wolf
VLM
56
195
0
29 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
122
249
0
24 Nov 2021
RedCaps: web-curated image-text data created by the people, for the
  people
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
101
166
0
22 Nov 2021
Attention Mechanisms in Computer Vision: A Survey
Attention Mechanisms in Computer Vision: A Survey
Meng-Hao Guo
Tianhan Xu
Jiangjiang Liu
Zheng-Ning Liu
Peng-Tao Jiang
Tai-Jiang Mu
Song-Hai Zhang
Ralph Robert Martin
Ming-Ming Cheng
Shimin Hu
100
1,689
0
15 Nov 2021
A Survey of Visual Transformers
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGS
ViT
130
350
0
11 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
92
634
0
09 Nov 2021
VLMo: Unified Vision-Language Pre-Training with
  Mixture-of-Modality-Experts
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLM
MLLM
MoE
69
553
0
03 Nov 2021
Recent Advances in Natural Language Processing via Large Pre-Trained
  Language Models: A Survey
Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey
Bonan Min
Hayley L Ross
Elior Sulem
Amir Pouran Ben Veyseh
Thien Huu Nguyen
Oscar Sainz
Eneko Agirre
Ilana Heinz
Dan Roth
LM&MA
VLM
AI4CE
119
1,072
0
01 Nov 2021
Transparency of Deep Neural Networks for Medical Image Analysis: A
  Review of Interpretability Methods
Transparency of Deep Neural Networks for Medical Image Analysis: A Review of Interpretability Methods
Zohaib Salahuddin
Henry C. Woodruff
A. Chatterjee
Philippe Lambin
56
316
0
01 Nov 2021
123
Next