Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 1,200 papers shown
Title
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
Jirayu Burapacheep
Ishan Gaur
Agam Bhatia
Tristan Thrush
57
5
0
07 Feb 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
56
2
0
31 Jan 2024
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijiao Zhang
Jindong Han
Zhao Xu
Hang Ni
Hao Liu
Hui Xiong
Hui Xiong
AI4CE
248
18
0
30 Jan 2024
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Ivana Beňová
Jana Kosecka
Michal Gregor
Martin Tamajka
Marcel Veselý
Marian Simko
61
1
0
29 Jan 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
80
0
0
29 Jan 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
102
5
0
29 Jan 2024
Memory-Inspired Temporal Prompt Interaction for Text-Image Classification
Xinyao Yu
Hao Sun
Ziwei Niu
Rui Qin
Zhenjia Bai
Yen-Wei Chen
Lanfen Lin
VLM
90
2
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
137
7
0
25 Jan 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He
Wenlin Yao
Kaixin Ma
Wenhao Yu
Yong Dai
Hongming Zhang
Zhenzhong Lan
Dong Yu
LLMAG
172
151
0
25 Jan 2024
Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models
Hongzhan Lin
Ziyang Luo
Wei Gao
Jing Ma
Bo Wang
Ruichao Yang
66
16
0
24 Jan 2024
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Debjyoti Mondal
Suraj Modi
Subhadarshi Panda
Rituraj Singh
Godawari Sudhakar Rao
LRM
77
46
0
23 Jan 2024
Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection
Fatma Shalabi
Hichem Felouat
H. Nguyen
Isao Echizen
MLLM
60
4
0
22 Jan 2024
Multi-level Cross-modal Alignment for Image Clustering
Liping Qiu
Qin Zhang
Xiaojun Chen
Shao-Qian Cai
40
1
0
22 Jan 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
71
4
0
21 Jan 2024
MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts
Haoqiang Guo
Sendong Zhao
Hao Wang
Yanrui Du
Bing Qin
AI4CE
77
8
0
21 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
79
0
0
11 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
61
7
0
11 Jan 2024
VLP: Vision Language Planning for Autonomous Driving
Chenbin Pan
Burhaneddin Yaman
T. Nesti
Abhirup Mallik
A. Allievi
Senem Velipasalar
Liu Ren
VLM
114
67
0
10 Jan 2024
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA
Jiawei Chen
Dingkang Yang
Yue Jiang
Yuxuan Lei
Lihua Zhang
LM&MA
MedIm
62
15
0
10 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
56
2
0
09 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
163
8
0
03 Jan 2024
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification
Xueling Zhu
Jian Liu
Dongqi Tang
Jiawei Ge
Weijia Liu
Bo Liu
Jiuxin Cao
VLM
61
1
0
02 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
298
3
0
28 Dec 2023
GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts
Da Wu
Jing Yang
Cong Liu
Tzung-Chien Hsieh
E. Marchi
...
Wendy K. Chung
G. Lyon
Ian D. Krantz
J. Kalish
Kai Wang
56
2
0
23 Dec 2023
Generative AI and the History of Architecture
J. Ploennigs
Markus Berger
83
1
0
22 Dec 2023
Towards a Unified Multimodal Reasoning Framework
Abhinav Arun
Dipendra Singh Mal
Mehul Soni
Tomohiro Sawada
LRM
37
0
0
22 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
94
8
0
19 Dec 2023
Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders
Bumsoo Kim
Jinhyung Kim
Yeonsik Jo
S. Kim
VLM
98
4
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
167
36
0
19 Dec 2023
UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts
Chenlu Zhan
Yufei Zhang
Yu Lin
Gaoang Wang
Hongwei Wang
VLM
MedIm
87
5
0
18 Dec 2023
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Tianlin Li
Jiandong Jin
Chenglong Li
Jin Tang
Cheng Zhang
Wei Wang
VLM
70
16
0
17 Dec 2023
Advancing Surgical VQA with Scene Graph Knowledge
Kun Yuan
Manasi Kattel
Joël L. Lavanchy
Nassir Navab
V. Srivastav
N. Padoy
124
21
0
15 Dec 2023
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun
Kim Sung-Bin
Seungju Han
Youngjae Yu
Tae-Hyun Oh
100
15
0
15 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
57
4
0
14 Dec 2023
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
Liqi He
Zuchao Li
Xiantao Cai
Ping Wang
LRM
83
25
0
14 Dec 2023
EZ-CLIP: Efficient Zeroshot Video Action Recognition
Shahzad Ahmad
S. Chanda
Yogesh S Rawat
VLM
76
7
0
13 Dec 2023
Multimodal Pretraining of Medical Time Series and Notes
Ryan N. King
Tianbao Yang
Bobak J. Mortazavi
59
14
0
11 Dec 2023
Medical Vision Language Pretraining: A survey
Prashant Shrestha
Sanskar Amgain
Bidur Khanal
Cristian A. Linte
Binod Bhattarai
VLM
94
17
0
11 Dec 2023
MATK: The Meme Analytical Tool Kit
Ming Shan Hee
Aditi Kumaresan
N. Hoang
Nirmalendu Prakash
Rui Cao
Roy Ka-wei Lee
VLM
52
2
0
11 Dec 2023
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models
Hongzhan Lin
Ziyang Luo
Jing Ma
Long Chen
60
12
0
09 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLM
ObjD
LRM
FAtt
93
12
0
07 Dec 2023
Adventures of Trustworthy Vision-Language Models: A Survey
Mayank Vatsa
Anubhooti Jain
Richa Singh
97
4
0
07 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLM
LRM
MLLM
126
37
0
05 Dec 2023
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
91
1
0
05 Dec 2023
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model
Guozhang Li
Xinpeng Ding
De Cheng
Jie Li
Nannan Wang
Xinbo Gao
100
1
0
05 Dec 2023
Recursive Visual Programming
Jiaxin Ge
Sanjay Subramanian
Baifeng Shi
Roei Herzig
Trevor Darrell
46
7
0
04 Dec 2023
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
Bingshuai Liu
Chenyang Lyu
Zijun Min
Zhanyu Wang
Jinsong Su
Longyue Wang
LRM
96
8
0
04 Dec 2023
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
Cong-Duy Nguyen
The-Anh Vu-Le
Thong Nguyen
Tho Quan
Anh Tuan Luu
100
6
0
04 Dec 2023
Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation
Yuzhe Lu
Sungmin Hong
Yash Shah
Panpan Xu
LM&MA
MedIm
64
7
0
03 Dec 2023
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham
Felix Petersen
Vittorio Ferrari
Hilde Kuehne
ObjD
VLM
121
49
0
01 Dec 2023
Previous
1
2
3
...
5
6
7
...
22
23
24
Next