ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.15857
  4. Cited By
A Survey on Image-text Multimodal Models

A Survey on Image-text Multimodal Models

23 September 2023
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
    VLM
ArXivPDFHTML

Papers citing "A Survey on Image-text Multimodal Models"

50 / 108 papers shown
Title
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
101
44
0
21 Sep 2021
Post-hoc Interpretability for Neural NLP: A Survey
Post-hoc Interpretability for Neural NLP: A Survey
Andreas Madsen
Siva Reddy
A. Chandar
XAI
64
231
0
10 Aug 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent
  Advances and Future Directions
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
74
140
0
29 Jul 2021
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
181
1,953
0
16 Jul 2021
From Show to Tell: A Survey on Deep Learning-based Image Captioning
From Show to Tell: A Survey on Deep Learning-based Image Captioning
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
S. Cascianelli
G. Fiameni
Rita Cucchiara
3DV
VLM
MLLM
109
269
0
14 Jul 2021
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval
Manh-Duy Nguyen
Binh T. Nguyen
C. Gurrin
30
16
0
04 Jun 2021
MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
Yash Khare
Viraj Bagal
Minesh Mathew
Adithi Devi
U. Priyakumar
C. V. Jawahar
MedIm
61
135
0
03 Apr 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
  Machine Learning
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
253
316
0
02 Mar 2021
SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical
  Visual Question Answering
SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering
Bo Liu
Li-Ming Zhan
Li Xu
Lin Ma
Y. Yang
Xiao-Ming Wu
70
262
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
426
1,127
0
17 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy
  Text Supervision
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
443
3,839
0
11 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region
  Supervision
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
114
1,741
0
05 Feb 2021
Similarity Reasoning and Filtration for Image-Text Matching
Similarity Reasoning and Filtration for Image-Text Matching
Haiwen Diao
Ying Zhang
Lingyun Ma
Huchuan Lu
280
335
0
05 Jan 2021
Transformers in Vision: A Survey
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
294
2,503
0
04 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via
  Cross-Modal Contrastive Learning
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
89
378
0
31 Dec 2020
A Survey on Visual Transformer
A Survey on Visual Transformer
Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
...
Chunjing Xu
Yixing Xu
Zhaohui Yang
Yiman Zhang
Dacheng Tao
ViT
190
2,223
0
23 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
77
144
0
08 Dec 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
211
5,068
0
08 Oct 2020
A Multimodal Memes Classification: A Survey and Open Research Issues
A Multimodal Memes Classification: A Survey and Open Research Issues
Tariq Habib Afridi
A. Alam
Muhammad Numan Khan
Jawad Khan
Young-Koo Lee
45
39
0
17 Sep 2020
A review of deep learning in medical imaging: Imaging traits, technology
  trends, case studies with progress highlights, and future promises
A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises
S. Kevin Zhou
H. Greenspan
Christos Davatzikos
James S. Duncan
Bram van Ginneken
A. Madabhushi
Jerry L. Prince
Daniel Rueckert
Ronald M. Summers
152
638
0
02 Aug 2020
Adversarial Uni- and Multi-modal Stream Networks for Multimodal Image
  Registration
Adversarial Uni- and Multi-modal Stream Networks for Multimodal Image Registration
Zhe Xu
Jie Luo
Jiangpeng Yan
Ritvik Pulya
Xiu Li
W. Wells
J. Jagadeesan
MedIm
45
61
0
06 Jul 2020
Structured Multimodal Attentions for TextVQA
Structured Multimodal Attentions for TextVQA
Chenyu Gao
Qi Zhu
Peng Wang
Hui Li
Yuliang Liu
Anton Van Den Hengel
Qi Wu
67
59
0
01 Jun 2020
End-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
377
13,025
0
26 May 2020
3D Deep Learning on Medical Images: A Review
3D Deep Learning on Medical Images: A Review
S. Singh
Lipo Wang
Sukrit Gupta
Haveesh Goli
P. Padmanabhan
Balázs Gulyás
MedIm
74
426
0
01 Apr 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
XGPT: Cross-modal Generative Pre-Training for Image Captioning
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLM
VLM
70
75
0
03 Mar 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
  Image-Text Data
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
78
261
0
22 Jan 2020
CHAOS Challenge -- Combined (CT-MR) Healthy Abdominal Organ Segmentation
CHAOS Challenge -- Combined (CT-MR) Healthy Abdominal Organ Segmentation
A. Emre Kavur
N. Gezer
M. Baris
Sinem Aslan
Pierre-Henri Conze
...
Klaus H. Maier-Hein
G. Akar
Gözde B. Ünal
O. Dicle
M. Alper Selver
80
621
0
17 Jan 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
345
939
0
24 Sep 2019
Embracing Imperfect Datasets: A Review of Deep Learning Solutions for
  Medical Image Segmentation
Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation
Nima Tajbakhsh
Laura Jeyaseelan
Q. Li
J. Chiang
Zhihao Wu
Xiaowei Ding
145
762
0
27 Aug 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
151
1,663
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
235
2,477
0
20 Aug 2019
Attention on Attention for Image Captioning
Attention on Attention for Image Captioning
Lun Huang
Wenmin Wang
Jie Chen
Xiao-Yong Wei
59
832
0
19 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
136
1,951
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
221
3,678
0
06 Aug 2019
Learning by Abstraction: The Neural State Machine
Learning by Abstraction: The Neural State Machine
Drew A. Hudson
Christopher D. Manning
NAI
OCL
65
260
0
09 Jul 2019
Towards VQA Models That Can Read
Towards VQA Models That Can Read
Amanpreet Singh
Vivek Natarajan
Meet Shah
Yu Jiang
Xinlei Chen
Dhruv Batra
Devi Parikh
Marcus Rohrbach
EgoV
77
1,216
0
18 Apr 2019
An Attentive Survey of Attention Models
An Attentive Survey of Attention Models
S. Chaudhari
Varun Mithal
Gungor Polatkan
R. Ramanath
124
657
0
05 Apr 2019
A large annotated medical image dataset for the development and
  evaluation of segmentation algorithms
A large annotated medical image dataset for the development and evaluation of segmentation algorithms
Amber L. Simpson
Michela Antonelli
Spyridon Bakas
Michel Bilello
Keyvan Farahani
...
M. McHugo
S. Napel
Eugene Vorontsov
Lena Maier-Hein
M. Jorge Cardoso
111
859
0
25 Feb 2019
MIMIC-CXR-JPG, a large publicly available database of labeled chest
  radiographs
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Alistair E. W. Johnson
Tom Pollard
Nathaniel R. Greenbaum
M. Lungren
Chih-ying Deng
Yifan Peng
Zhiyong Lu
R. Mark
Seth Berkowitz
Steven Horng
MedIm
94
810
0
21 Jan 2019
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and
  Expert Comparison
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison
Jeremy Irvin
Pranav Rajpurkar
M. Ko
Yifan Yu
Silviana Ciurea-Ilcus
...
D. Larson
C. Langlotz
Bhavik Patel
M. Lungren
A. Ng
110
2,591
0
21 Jan 2019
CCNet: Criss-Cross Attention for Semantic Segmentation
CCNet: Criss-Cross Attention for Semantic Segmentation
Zilong Huang
Xinggang Wang
Yunchao Wei
Lichao Huang
Humphrey Shi
Wenyu Liu
Chang Huang
VOS
210
2,544
0
28 Nov 2018
Medical Image Synthesis for Data Augmentation and Anonymization using
  Generative Adversarial Networks
Medical Image Synthesis for Data Augmentation and Anonymization using Generative Adversarial Networks
Hoo-Chang Shin
Neil A. Tenenholtz
Jameson K. Rogers
C. Schwarz
M. Senjem
J. Gunter
Katherine P. Andriole
Mark H. Michalski
MedIm
99
540
0
26 Jul 2018
Attention Models in Graphs: A Survey
Attention Models in Graphs: A Survey
J. B. Lee
Ryan A. Rossi
Sungchul Kim
Nesreen K. Ahmed
Eunyee Koh
GNN
58
164
0
20 Jul 2018
Stacked Cross Attention for Image-Text Matching
Stacked Cross Attention for Image-Text Matching
Kuang-Huei Lee
Xi Chen
G. Hua
Houdong Hu
Xiaodong He
74
1,151
0
21 Mar 2018
MaskGAN: Better Text Generation via Filling in the______
MaskGAN: Better Text Generation via Filling in the______
W. Fedus
Ian Goodfellow
Andrew M. Dai
79
470
0
23 Jan 2018
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
119
4,214
0
25 Jul 2017
Multimodal Machine Learning: A Survey and Taxonomy
Multimodal Machine Learning: A Survey and Taxonomy
T. Baltrušaitis
Chaitanya Ahuja
Louis-Philippe Morency
80
2,928
0
26 May 2017
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
  Segmentation
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Vijay Badrinarayanan
Alex Kendall
R. Cipolla
SSeg
1.1K
15,798
0
02 Nov 2015
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Julia Hockenmaier
Svetlana Lazebnik
193
2,053
0
19 May 2015
Microsoft COCO Captions: Data Collection and Evaluation Server
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
209
2,475
0
01 Apr 2015
Previous
123
Next