Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 1,178 papers shown
Title
PosSAM: Panoptic Open-vocabulary Segment Anything
VS Vibashan
Shubhankar Borse
Hyojin Park
Debasmit Das
Vishal M. Patel
Munawar Hayat
Fatih Porikli
VLM
MLLM
43
6
0
14 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
43
187
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
44
12
0
14 Mar 2024
Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
Dong Shu
Zhouyao Zhu
37
1
0
14 Mar 2024
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Long Lan
Fengxiang Wang
Shuyan Li
Xiangtao Zheng
Zengmao Wang
Xinwang Liu
VLM
31
8
0
13 Mar 2024
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat
Longju Bai
Joan Nwatu
Rada Mihalcea
39
6
0
12 Mar 2024
Noise-powered Multi-modal Knowledge Graph Representation Framework
Zhuo Chen
Yin Fang
Yichi Zhang
Lingbing Guo
Jiaoyan Chen
Hua-zeng Chen
Wen Zhang
Wen Zhang
28
0
0
11 Mar 2024
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
Peng Qi
Zehong Yan
W. Hsu
M. Lee
MLLM
56
32
0
05 Mar 2024
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Iryna Hartsock
Ghulam Rasool
49
63
0
04 Mar 2024
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen
Zhuokai Zhao
Hongyin Luo
Huaxiu Yao
Bo Li
Jiawei Zhou
MLLM
46
57
0
01 Mar 2024
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
34
2
0
27 Feb 2024
Vision Transformers with Natural Language Semantics
Young-Kyung Kim
Matías Di Martino
Guillermo Sapiro
ViT
23
5
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
40
3
0
27 Feb 2024
CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification
Haoran Lai
Qingsong Yao
Zihang Jiang
Rongsheng Wang
Zhiyang He
Xiaodong Tao
S. Kevin Zhou
MedIm
50
12
0
27 Feb 2024
ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking
Yushan Han
Kaer Huang
32
0
0
27 Feb 2024
How Can LLM Guide RL? A Value-Based Approach
Shenao Zhang
Sirui Zheng
Shuqi Ke
Zhihan Liu
Wanxin Jin
Jianbo Yuan
Yingxiang Yang
Hongxia Yang
Zhaoran Wang
35
8
0
25 Feb 2024
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Jiazhao Zhang
Kunyu Wang
Rongtao Xu
Gengze Zhou
Yicong Hong
Xiaomeng Fang
Qi Wu
Zhizheng Zhang
Wang He
LM&Ro
40
45
0
24 Feb 2024
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long
Xuri Ge
R. McCreadie
Joemon M. Jose
32
5
0
23 Feb 2024
Efficient data selection employing Semantic Similarity-based Graph Structures for model training
Roxana Petcu
Subhadeep Maji
28
0
0
22 Feb 2024
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
LRM
VLM
59
41
0
19 Feb 2024
Strong hallucinations from negation and how to fix them
Nicholas Asher
Swarnadeep Bhar
ReLM
LRM
43
4
0
16 Feb 2024
MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding
Hai-Tao Yu
Mofei Song
3DPC
22
7
0
15 Feb 2024
Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection
E. Hossain
Omar Sharif
M. M. Hoque
S. Preum
24
3
0
15 Feb 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
26
13
0
15 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
40
5
0
14 Feb 2024
Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search
Yifei Yuan
Clemencia Siro
Mohammad Aliannejadi
Maarten de Rijke
Wai Lam
26
6
0
12 Feb 2024
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation
Xiaohan Yu
Li Zhang
Xin Zhao
Yue Wang
Zhongrui Ma
53
10
0
07 Feb 2024
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
Jirayu Burapacheep
Ishan Gaur
Agam Bhatia
Tristan Thrush
40
4
0
07 Feb 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
50
1
0
31 Jan 2024
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijiao Zhang
Jindong Han
Zhao Xu
Hang Ni
Hao Liu
Hui Xiong
Hui Xiong
AI4CE
79
15
0
30 Jan 2024
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Ivana Beňová
Jana Kosecka
Michal Gregor
Martin Tamajka
Marcel Veselý
Marian Simko
30
1
0
29 Jan 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
44
0
0
29 Jan 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
40
5
0
29 Jan 2024
Memory-Inspired Temporal Prompt Interaction for Text-Image Classification
Xinyao Yu
Hao Sun
Ziwei Niu
Rui Qin
Zhenjia Bai
Yen-Wei Chen
Lanfen Lin
VLM
39
2
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
22
7
0
25 Jan 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He
Wenlin Yao
Kaixin Ma
Wenhao Yu
Yong Dai
Hongming Zhang
Zhenzhong Lan
Dong Yu
LLMAG
40
121
0
25 Jan 2024
Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models
Hongzhan Lin
Ziyang Luo
Wei Gao
Jing Ma
Bo Wang
Ruichao Yang
34
13
0
24 Jan 2024
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Debjyoti Mondal
Suraj Modi
Subhadarshi Panda
Rituraj Singh
Godawari Sudhakar Rao
LRM
28
38
0
23 Jan 2024
Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection
Fatma Shalabi
Hichem Felouat
H. Nguyen
Isao Echizen
MLLM
33
3
0
22 Jan 2024
Multi-level Cross-modal Alignment for Image Clustering
Liping Qiu
Qin Zhang
Xiaojun Chen
Shao-Qian Cai
22
1
0
22 Jan 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
38
4
0
21 Jan 2024
MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts
Haoqiang Guo
Sendong Zhao
Hao Wang
Yanrui Du
Bing Qin
AI4CE
21
8
0
21 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
37
0
0
11 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
18
6
0
11 Jan 2024
VLP: Vision Language Planning for Autonomous Driving
Chenbin Pan
Burhaneddin Yaman
T. Nesti
Abhirup Mallik
A. Allievi
Senem Velipasalar
Liu Ren
VLM
27
56
0
10 Jan 2024
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA
Jiawei Chen
Dingkang Yang
Yue Jiang
Yuxuan Lei
Lihua Zhang
LM&MA
MedIm
13
13
0
10 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
21
2
0
09 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
97
8
0
03 Jan 2024
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification
Xueling Zhu
Jian Liu
Dongqi Tang
Jiawei Ge
Weijia Liu
Bo Liu
Jiuxin Cao
VLM
27
1
0
02 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
49
4
0
28 Dec 2023
Previous
1
2
3
4
5
6
...
22
23
24
Next