Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.07356
Cited By
Vision-and-Language Pretrained Models: A Survey
15 April 2022
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Vision-and-Language Pretrained Models: A Survey"
47 / 47 papers shown
Title
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Shuo Yang
Siwen Luo
S. Han
Eduard Hovy
LRM
39
0
0
24 Mar 2025
PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior
S. Lee
Hwanhee Jung
Byoungsoo Koh
Qixing Huang
Sangho Yoon
Sangpil Kim
49
0
0
17 Mar 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
105
4
0
12 Feb 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
88
11
0
06 Jan 2025
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Wonseok Roh
Hwanhee Jung
Jong Wook Kim
S. Lee
Innfarn Yoo
Andreas Lugmayr
Seunggeun Chi
K. Ramani
Sangpil Kim
3DGS
87
2
0
17 Dec 2024
Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond
Soyeon Caren Han
Feiqi Cao
Josiah Poon
Roberto Navigli
MLLM
VLM
32
5
0
08 Oct 2024
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models
Harshit
Tolga Tasdizen
CoGe
VLM
28
1
0
06 Oct 2024
A Large-Scale Study of Model Integration in ML-Enabled Software Systems
Yorick Sens
Henriette Knopp
Sven Peldszus
Thorsten Berger
AIFin
31
2
0
12 Aug 2024
Segment Anything for Videos: A Systematic Survey
Chunhui Zhang
Yawen Cui
Weilin Lin
Guanjie Huang
Yan Rong
Li Liu
Shiguang Shan
VLM
44
6
0
31 Jul 2024
InFiConD: Interactive No-code Fine-tuning with Concept-based Knowledge Distillation
Jinbin Huang
Wenbin He
Liang Gou
Liu Ren
Chris Bryan
50
0
0
25 Jun 2024
Younger: The First Dataset for Artificial Intelligence-Generated Neural Network Architecture
Zhengxin Yang
Wanling Gao
Luzhou Peng
Yunyou Huang
Fei Tang
Jianfeng Zhan
33
0
0
20 Jun 2024
Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
Mingqian Feng
Yunlong Tang
Zeliang Zhang
Chenliang Xu
34
3
0
18 Jun 2024
Feature Distribution Shift Mitigation with Contrastive Pretraining for Intrusion Detection
Weixing Wang
Haojin Yang
Christoph Meinel
Hasan Yağiz Özkan
Cristian Bermudez Serna
C. M. Machuca
19
0
0
23 Apr 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
43
1
0
19 Apr 2024
RankCLIP: Ranking-Consistent Language-Image Pretraining
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zhili Feng
Zenghui Ding
Yining Sun
SSL
VLM
48
7
0
15 Apr 2024
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen
Zhuokai Zhao
Hongyin Luo
Huaxiu Yao
Bo Li
Jiawei Zhou
MLLM
46
57
0
01 Mar 2024
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
22
5
0
08 Feb 2024
A Survey on Hallucination in Large Vision-Language Models
Hanchao Liu
Wenyuan Xue
Yifei Chen
Dapeng Chen
Xiutian Zhao
Ke Wang
Liping Hou
Rong-Zhi Li
Wei Peng
LRM
MLLM
29
113
0
01 Feb 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
52
179
0
24 Jan 2024
Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution
Ying Wang
Tim G. J. Rudner
Andrew Gordon Wilson
15
19
0
28 Dec 2023
DSAP: Analyzing Bias Through Demographic Comparison of Datasets
Iris Dominguez-Catena
D. Paternain
M. Galar
35
4
0
22 Dec 2023
Adventures of Trustworthy Vision-Language Models: A Survey
Mayank Vatsa
Anubhooti Jain
Richa Singh
22
4
0
07 Dec 2023
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
21
4
0
28 Nov 2023
Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey
Weixu Zhang
Yifei Wang
Yuanfeng Song
Victor Junqiu Wei
Yuxing Tian
Yiyan Qi
Jonathan H. Chan
Raymond Chi-Wing Wong
Haiqin Yang
LMTD
46
15
0
27 Oct 2023
PSP: Pre-Training and Structure Prompt Tuning for Graph Neural Networks
Qingqing Ge
Zeyuan Zhao
Yiding Liu
Anfeng Cheng
Xiang Li
Shuaiqiang Wang
Dawei Yin
26
6
0
26 Oct 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai Le-Duc
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
31
5
0
23 Sep 2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
Bilel Benjdira
Anis Koubaa
Anas M. Ali
LM&Ro
22
3
0
22 Aug 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
38
118
0
25 Jul 2023
Prototypical Contrastive Transfer Learning for Multimodal Language Understanding
Seitaro Otsuki
Shintaro Ishikawa
K. Sugiura
43
1
0
12 Jul 2023
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
F. Liu
Delong Chen
Zhan-Rong Guan
Xiaocong Zhou
Jiale Zhu
Qiaolin Ye
Liyong Fu
Jun Zhou
VLM
68
191
0
19 Jun 2023
Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias
Zhongwei Wan
Che Liu
Mi Zhang
Jie Fu
Benyou Wang
Sibo Cheng
Lei Ma
César Quilodrán-Casas
Rossella Arcucci
44
71
0
31 May 2023
A Diffusion Model for Event Skeleton Generation
Fangqi Zhu
Lin Zhang
Junfeng Gao
Bing Qin
Ruifeng Xu
Haiqing Yang
DiffM
15
2
0
27 May 2023
ProgSG: Cross-Modality Representation Learning for Programs in Electronic Design Automation
Yunsheng Bai
Atefeh Sohrabizadeh
Zongyue Qin
Ziniu Hu
Yizhou Sun
Jason Cong
18
1
0
18 May 2023
A Comprehensive Survey on Segment Anything Model for Vision and Beyond
Chunhui Zhang
Li Liu
Yawen Cui
Guanjie Huang
Weilin Lin
Yiqian Yang
Yuehong Hu
VLM
34
90
0
14 May 2023
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues
Yunxin Li
Baotian Hu
Xinyu Chen
Yuxin Ding
Lin Ma
Min Zhang
LRM
48
14
0
08 May 2023
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Feilong Chen
Minglun Han
Haozhi Zhao
Qingyang Zhang
Jing Shi
Shuang Xu
Bo Xu
MLLM
36
115
0
07 May 2023
Multimodal Understanding Through Correlation Maximization and Minimization
Yi Shi
Marc Niethammer
33
0
0
04 May 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
31
202
0
20 Feb 2023
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
23
26
0
01 Feb 2023
SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering
Feiqi Cao
Siwen Luo
F. Núñez
Zean Wen
Josiah Poon
Caren Han
GNN
23
4
0
16 Dec 2022
PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals
Zhihao Zhang
Siwen Luo
Junyi Chen
Sijia Lai
Siqu Long
Hyunsuk Chung
S. Han
17
1
0
29 Nov 2022
Universal Prompt Tuning for Graph Neural Networks
Taoran Fang
Yunchao Zhang
Yang Yang
Chunping Wang
Lei Chen
24
47
0
30 Sep 2022
VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models
Felix Vogel
Nina Shvetsova
Leonid Karlinsky
Hilde Kuehne
VLM
63
7
0
12 Sep 2022
Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis
Siwen Luo
Yi Ding
Siqu Long
Josiah Poon
S. Han
GNN
20
16
0
22 Aug 2022
Understanding Attention for Vision-and-Language Tasks
Feiqi Cao
S. Han
Siqu Long
Changwei Xu
Josiah Poon
34
5
0
17 Aug 2022
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
298
3,700
0
11 Feb 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
252
927
0
24 Sep 2019
1