ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,200 papers shown
Title
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
Jirayu Burapacheep
Ishan Gaur
Agam Bhatia
Tristan Thrush
57
5
0
07 Feb 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
  for Spatial Proximity Analysis
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
56
2
0
31 Jan 2024
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijiao Zhang
Jindong Han
Zhao Xu
Hang Ni
Hao Liu
Hui Xiong
Hui Xiong
AI4CE
248
18
0
30 Jan 2024
Beyond Image-Text Matching: Verb Understanding in Multimodal
  Transformers Using Guided Masking
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Ivana Beňová
Jana Kosecka
Michal Gregor
Martin Tamajka
Marcel Veselý
Marian Simko
61
1
0
29 Jan 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
80
0
0
29 Jan 2024
Image-Text Out-Of-Context Detection Using Synthetic Multimodal
  Misinformation
Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation
Fatma Shalabi
H. Nguyen
Hichem Felouat
Ching-Chun Chang
Isao Echizen
102
5
0
29 Jan 2024
Memory-Inspired Temporal Prompt Interaction for Text-Image
  Classification
Memory-Inspired Temporal Prompt Interaction for Text-Image Classification
Xinyao Yu
Hao Sun
Ziwei Niu
Rui Qin
Zhenjia Bai
Yen-Wei Chen
Lanfen Lin
VLM
90
2
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
137
7
0
25 Jan 2024
WebVoyager: Building an End-to-End Web Agent with Large Multimodal
  Models
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He
Wenlin Yao
Kaixin Ma
Wenhao Yu
Yong Dai
Hongming Zhang
Zhenzhong Lan
Dong Yu
LLMAG
172
151
0
25 Jan 2024
Towards Explainable Harmful Meme Detection through Multimodal Debate
  between Large Language Models
Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models
Hongzhan Lin
Ziyang Luo
Wei Gao
Jing Ma
Bo Wang
Ruichao Yang
66
16
0
24 Jan 2024
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Debjyoti Mondal
Suraj Modi
Subhadarshi Panda
Rituraj Singh
Godawari Sudhakar Rao
LRM
77
46
0
23 Jan 2024
Leveraging Chat-Based Large Vision Language Models for Multimodal
  Out-Of-Context Detection
Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection
Fatma Shalabi
Hichem Felouat
H. Nguyen
Isao Echizen
MLLM
60
4
0
22 Jan 2024
Multi-level Cross-modal Alignment for Image Clustering
Multi-level Cross-modal Alignment for Image Clustering
Liping Qiu
Qin Zhang
Xiaojun Chen
Shao-Qian Cai
40
1
0
22 Jan 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
71
4
0
21 Jan 2024
MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks
  via Text Prompts
MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts
Haoqiang Guo
Sendong Zhao
Hao Wang
Yanrui Du
Bing Qin
AI4CE
77
8
0
21 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image
  Patch Selection
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
79
0
0
11 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention
  Network for Crisis Event Classification
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
61
7
0
11 Jan 2024
VLP: Vision Language Planning for Autonomous Driving
VLP: Vision Language Planning for Autonomous Driving
Chenbin Pan
Burhaneddin Yaman
T. Nesti
Abhirup Mallik
A. Allievi
Senem Velipasalar
Liu Ren
VLM
114
67
0
10 Jan 2024
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA
MISS: A Generative Pretraining and Finetuning Approach for Med-VQA
Jiawei Chen
Dingkang Yang
Yue Jiang
Yuxuan Lei
Lihua Zhang
LM&MAMedIm
62
15
0
10 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
56
2
0
09 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question
  Answering
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
163
8
0
03 Jan 2024
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label
  Classification
Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification
Xueling Zhu
Jian Liu
Dongqi Tang
Jiawei Ge
Weijia Liu
Bo Liu
Jiuxin Cao
VLM
61
1
0
02 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGeVLM
298
3
0
28 Dec 2023
GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal
  Machine Learning Combining Facial Images and Clinical Texts
GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts
Da Wu
Jing Yang
Cong Liu
Tzung-Chien Hsieh
E. Marchi
...
Wendy K. Chung
G. Lyon
Ian D. Krantz
J. Kalish
Kai Wang
56
2
0
23 Dec 2023
Generative AI and the History of Architecture
Generative AI and the History of Architecture
J. Ploennigs
Markus Berger
83
1
0
22 Dec 2023
Towards a Unified Multimodal Reasoning Framework
Towards a Unified Multimodal Reasoning Framework
Abhinav Arun
Dipendra Singh Mal
Mehul Soni
Tomohiro Sawada
LRM
37
0
0
22 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in
  Language-Image Pretraining
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
94
8
0
19 Dec 2023
Expediting Contrastive Language-Image Pretraining via Self-distilled
  Encoders
Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders
Bumsoo Kim
Jinhyung Kim
Yeonsik Jo
S. Kim
VLM
98
4
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLMMLLM
167
36
0
19 Dec 2023
UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic
  Cross-modal Learnable Prompts
UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts
Chenlu Zhan
Yufei Zhang
Yu Lin
Gaoang Wang
Hongwei Wang
VLMMedIm
87
5
0
18 Dec 2023
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language
  Fusion
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Tianlin Li
Jiandong Jin
Chenglong Li
Jin Tang
Cheng Zhang
Wei Wang
VLM
70
16
0
17 Dec 2023
Advancing Surgical VQA with Scene Graph Knowledge
Advancing Surgical VQA with Scene Graph Knowledge
Kun Yuan
Manasi Kattel
Joël L. Lavanchy
Nassir Navab
V. Srivastav
N. Padoy
124
21
0
15 Dec 2023
SMILE: Multimodal Dataset for Understanding Laughter in Video with
  Language Models
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun
Kim Sung-Bin
Seungju Han
Youngjae Yu
Tae-Hyun Oh
100
15
0
15 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIPVLM
57
4
0
14 Dec 2023
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in
  Language Models
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
Liqi He
Zuchao Li
Xiantao Cai
Ping Wang
LRM
83
25
0
14 Dec 2023
EZ-CLIP: Efficient Zeroshot Video Action Recognition
EZ-CLIP: Efficient Zeroshot Video Action Recognition
Shahzad Ahmad
S. Chanda
Yogesh S Rawat
VLM
76
7
0
13 Dec 2023
Multimodal Pretraining of Medical Time Series and Notes
Multimodal Pretraining of Medical Time Series and Notes
Ryan N. King
Tianbao Yang
Bobak J. Mortazavi
59
14
0
11 Dec 2023
Medical Vision Language Pretraining: A survey
Medical Vision Language Pretraining: A survey
Prashant Shrestha
Sanskar Amgain
Bidur Khanal
Cristian A. Linte
Binod Bhattarai
VLM
94
17
0
11 Dec 2023
MATK: The Meme Analytical Tool Kit
MATK: The Meme Analytical Tool Kit
Ming Shan Hee
Aditi Kumaresan
N. Hoang
Nirmalendu Prakash
Rui Cao
Roy Ka-wei Lee
VLM
52
2
0
11 Dec 2023
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning
  Distilled from Large Language Models
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models
Hongzhan Lin
Ziyang Luo
Jing Ma
Long Chen
60
12
0
09 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLMObjDLRMFAtt
93
12
0
07 Dec 2023
Adventures of Trustworthy Vision-Language Models: A Survey
Adventures of Trustworthy Vision-Language Models: A Survey
Mayank Vatsa
Anubhooti Jain
Richa Singh
97
4
0
07 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning
  into Vision-Language Models
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLMLRMMLLM
126
37
0
05 Dec 2023
Training on Synthetic Data Beats Real Data in Multimodal Relation
  Extraction
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
91
1
0
05 Dec 2023
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
  Grounding with Multimodal Large Language Model
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model
Guozhang Li
Xinpeng Ding
De Cheng
Jie Li
Nannan Wang
Xinbo Gao
100
1
0
05 Dec 2023
Recursive Visual Programming
Recursive Visual Programming
Jiaxin Ge
Sanjay Subramanian
Baifeng Shi
Roei Herzig
Trevor Darrell
46
7
0
04 Dec 2023
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
  Language Models
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
Bingshuai Liu
Chenyang Lyu
Zijun Min
Zhanyu Wang
Jinsong Su
Longyue Wang
LRM
96
8
0
04 Dec 2023
Expand BERT Representation with Visual Information via Grounded Language
  Learning with Multimodal Partial Alignment
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
Cong-Duy Nguyen
The-Anh Vu-Le
Thong Nguyen
Tho Quan
Anh Tuan Luu
100
6
0
04 Dec 2023
Effectively Fine-tune to Improve Large Multimodal Models for Radiology
  Report Generation
Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation
Yuzhe Lu
Sungmin Hong
Yash Shah
Panpan Xu
LM&MAMedIm
64
7
0
03 Dec 2023
Grounding Everything: Emerging Localization Properties in
  Vision-Language Transformers
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham
Felix Petersen
Vittorio Ferrari
Hilde Kuehne
ObjDVLM
121
49
0
01 Dec 2023
Previous
123...567...222324
Next