ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXivPDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,178 papers shown
Title
PosSAM: Panoptic Open-vocabulary Segment Anything
PosSAM: Panoptic Open-vocabulary Segment Anything
VS Vibashan
Shubhankar Borse
Hyojin Park
Debasmit Das
Vishal M. Patel
Munawar Hayat
Fatih Porikli
VLM
MLLM
43
6
0
14 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
43
187
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling
  and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
44
12
0
14 Mar 2024
Generative Models and Connected and Automated Vehicles: A Survey in
  Exploring the Intersection of Transportation and AI
Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
Dong Shu
Zhouyao Zhu
37
1
0
14 Mar 2024
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained
  Ship Classification
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Long Lan
Fengxiang Wang
Shuyan Li
Xiangtao Zheng
Zengmao Wang
Xinwang Liu
VLM
31
8
0
13 Mar 2024
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model
  Performance and Annotation Cost
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat
Longju Bai
Joan Nwatu
Rada Mihalcea
39
6
0
12 Mar 2024
Noise-powered Multi-modal Knowledge Graph Representation Framework
Noise-powered Multi-modal Knowledge Graph Representation Framework
Zhuo Chen
Yin Fang
Yichi Zhang
Lingbing Guo
Jiaoyan Chen
Hua-zeng Chen
Wen Zhang
Wen Zhang
28
0
0
11 Mar 2024
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context
  Misinformation Detection
SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
Peng Qi
Zehong Yan
W. Hsu
M. Lee
MLLM
56
32
0
05 Mar 2024
Vision-Language Models for Medical Report Generation and Visual Question
  Answering: A Review
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Iryna Hartsock
Ghulam Rasool
49
63
0
04 Mar 2024
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast
  Decoding
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen
Zhuokai Zhao
Hongyin Luo
Huaxiu Yao
Bo Li
Jiawei Zhou
MLLM
46
57
0
01 Mar 2024
Acquiring Linguistic Knowledge from Multimodal Input
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
34
2
0
27 Feb 2024
Vision Transformers with Natural Language Semantics
Vision Transformers with Natural Language Semantics
Young-Kyung Kim
Matías Di Martino
Guillermo Sapiro
ViT
23
5
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation
  Learning
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
40
3
0
27 Feb 2024
CARZero: Cross-Attention Alignment for Radiology Zero-Shot
  Classification
CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification
Haoran Lai
Qingsong Yao
Zihang Jiang
Rongsheng Wang
Zhiyang He
Xiaodong Tao
S. Kevin Zhou
MedIm
50
12
0
27 Feb 2024
ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking
ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking
Yushan Han
Kaer Huang
32
0
0
27 Feb 2024
How Can LLM Guide RL? A Value-Based Approach
How Can LLM Guide RL? A Value-Based Approach
Shenao Zhang
Sirui Zheng
Shuqi Ke
Zhihan Liu
Wanxin Jin
Jianbo Yuan
Yingxiang Yang
Hongxia Yang
Zhaoran Wang
35
8
0
25 Feb 2024
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language
  Navigation
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Jiazhao Zhang
Kunyu Wang
Rongtao Xu
Gengze Zhou
Yicong Hong
Xiaomeng Fang
Qi Wu
Zhizheng Zhang
Wang He
LM&Ro
40
45
0
24 Feb 2024
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long
Xuri Ge
R. McCreadie
Joemon M. Jose
32
5
0
23 Feb 2024
Efficient data selection employing Semantic Similarity-based Graph
  Structures for model training
Efficient data selection employing Semantic Similarity-based Graph Structures for model training
Roxana Petcu
Subhadeep Maji
28
0
0
22 Feb 2024
The Revolution of Multimodal Large Language Models: A Survey
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
LRM
VLM
59
41
0
19 Feb 2024
Strong hallucinations from negation and how to fix them
Strong hallucinations from negation and how to fix them
Nicholas Asher
Swarnadeep Bhar
ReLM
LRM
43
4
0
16 Feb 2024
MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D
  Point Cloud Understanding
MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding
Hai-Tao Yu
Mofei Song
3DPC
22
7
0
15 Feb 2024
Align before Attend: Aligning Visual and Textual Features for Multimodal
  Hateful Content Detection
Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection
E. Hossain
Omar Sharif
M. M. Hoque
S. Preum
24
3
0
15 Feb 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
26
13
0
15 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
40
5
0
14 Feb 2024
Asking Multimodal Clarifying Questions in Mixed-Initiative
  Conversational Search
Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search
Yifei Yuan
Clemencia Siro
Mohammad Aliannejadi
Maarten de Rijke
Wai Lam
26
6
0
12 Feb 2024
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based
  Recommendation
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation
Xiaohan Yu
Li Zhang
Xin Zhao
Yue Wang
Zhongrui Ma
53
10
0
07 Feb 2024
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
Jirayu Burapacheep
Ishan Gaur
Agam Bhatia
Tristan Thrush
40
4
0
07 Feb 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
  for Spatial Proximity Analysis
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
50
1
0
31 Jan 2024
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijiao Zhang
Jindong Han
Zhao Xu
Hang Ni
Hao Liu
Hui Xiong
Hui Xiong
AI4CE
79
15
0
30 Jan 2024
Beyond Image-Text Matching: Verb Understanding in Multimodal
  Transformers Using Guided Masking
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Ivana Beňová
Jana Kosecka
Michal Gregor
Martin Tamajka
Marcel Veselý
Marian Simko
30
1
0
29 Jan 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
44
0
0
29 Jan 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal
  Misinformation
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
40
5
0
29 Jan 2024
Memory-Inspired Temporal Prompt Interaction for Text-Image
  Classification
Memory-Inspired Temporal Prompt Interaction for Text-Image Classification
Xinyao Yu
Hao Sun
Ziwei Niu
Rui Qin
Zhenjia Bai
Yen-Wei Chen
Lanfen Lin
VLM
39
2
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
22
7
0
25 Jan 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal
  Models
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He
Wenlin Yao
Kaixin Ma
Wenhao Yu
Yong Dai
Hongming Zhang
Zhenzhong Lan
Dong Yu
LLMAG
40
121
0
25 Jan 2024
Towards Explainable Harmful Meme Detection through Multimodal Debate
  between Large Language Models
Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models
Hongzhan Lin
Ziyang Luo
Wei Gao
Jing Ma
Bo Wang
Ruichao Yang
34
13
0
24 Jan 2024
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Debjyoti Mondal
Suraj Modi
Subhadarshi Panda
Rituraj Singh
Godawari Sudhakar Rao
LRM
28
38
0
23 Jan 2024
Leveraging Chat-Based Large Vision Language Models for Multimodal
  Out-Of-Context Detection
Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection
Fatma Shalabi
Hichem Felouat
H. Nguyen
Isao Echizen
MLLM
33
3
0
22 Jan 2024
Multi-level Cross-modal Alignment for Image Clustering
Multi-level Cross-modal Alignment for Image Clustering
Liping Qiu
Qin Zhang
Xiaojun Chen
Shao-Qian Cai
22
1
0
22 Jan 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
38
4
0
21 Jan 2024
MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks
  via Text Prompts
MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts
Haoqiang Guo
Sendong Zhao
Hao Wang
Yanrui Du
Bing Qin
AI4CE
21
8
0
21 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image
  Patch Selection
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
37
0
0
11 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention
  Network for Crisis Event Classification
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
18
6
0
11 Jan 2024
VLP: Vision Language Planning for Autonomous Driving
VLP: Vision Language Planning for Autonomous Driving
Chenbin Pan
Burhaneddin Yaman
T. Nesti
Abhirup Mallik
A. Allievi
Senem Velipasalar
Liu Ren
VLM
27
56
0
10 Jan 2024
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA
Jiawei Chen
Dingkang Yang
Yue Jiang
Yuxuan Lei
Lihua Zhang
LM&MA
MedIm
13
13
0
10 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
21
2
0
09 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question
  Answering
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
97
8
0
03 Jan 2024
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label
  Classification
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification
Xueling Zhu
Jian Liu
Dongqi Tang
Jiawei Ge
Weijia Liu
Bo Liu
Jiuxin Cao
VLM
27
1
0
02 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
49
4
0
28 Dec 2023
Previous
123456...222324
Next