ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,200 papers shown
Title
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
Zhangyang Qi
Zhixiong Zhang
Yizhou Yu
Jiaqi Wang
Hengshuang Zhao
LM&RoAI4TS
48
0
0
20 Jun 2025
Understanding GUI Agent Localization Biases through Logit Sharpness
Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao
Yiwei Wang
Yujun Cai
Zhicheng YANG
Jing Tang
LLMAG
15
0
0
18 Jun 2025
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
Xuelin Shen
Jiayin Xu
Kangsheng Yin
Wenhan Yang
AAML
19
0
0
18 Jun 2025
HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
Numair Nadeem
Saeed Anwar
Muhammad Asad
Abdul Bais
VLM
22
0
0
16 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoEVLM
30
0
0
13 Jun 2025
RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
Haotian Ni
Yake Wei
Hang Liu
Gong Chen
Chong Peng
Hao Lin
Di Hu
OffRL
73
0
0
13 Jun 2025
Vision Generalist Model: A Survey
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
65
0
0
11 Jun 2025
Multimodal Representation Alignment for Cross-modal Information Retrieval
Fan Xu
Luis A. Leiva
17
0
0
10 Jun 2025
OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
Jiewen Hu
Leena Mathur
Paul Pu Liang
Louis-Philippe Morency
CVBM
57
0
0
03 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
51
0
0
02 Jun 2025
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
Zhaotian Weng
Haoxuan Li
Kuan-Hao Huang
Jieyu Zhao
LRMCoGe
32
0
0
01 Jun 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Junyu Luo
Zhizhuo Kou
Liming Yang
Xiao Luo
Jinsheng Huang
...
Jiaming Ji
Xuanzhe Liu
Sirui Han
Ming Zhang
Yike Guo
20
0
0
30 May 2025
Multi-MLLM Knowledge Distillation for Out-of-Context News Detection
Multi-MLLM Knowledge Distillation for Out-of-Context News Detection
Yimeng Gu
Zhao Tong
Ignacio Castro
Shu Wu
Gareth Tyson
15
0
0
28 May 2025
LifeIR at the NTCIR-18 Lifelog-6 Task
LifeIR at the NTCIR-18 Lifelog-6 Task
Jiahan Chen
Da Li
Keping Bi
30
0
0
27 May 2025
Multi-modal brain encoding models for multi-modal stimuli
Multi-modal brain encoding models for multi-modal stimuli
Subba Reddy Oota
Khushbu Pahwa
Mounika Marreddy
Maneesh Singh
Manish Gupta
Bapi S. Raju
25
1
0
26 May 2025
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Matthew Lisondra
B. Benhabib
G. Nejat
LM&Ro
74
0
0
26 May 2025
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
Md. Mithun Hossain
Md. Shakil Hossain
Sudipto Chaki
M. F. Mridha
255
0
0
25 May 2025
Visual Question Answering on Multiple Remote Sensing Image Modalities
Visual Question Answering on Multiple Remote Sensing Image Modalities
Hichem Boussaid
Lucrezia Tosato
F. Weissgerber
Camille Kurtz
Laurent Wendling
Sylvain Lobry
64
0
0
21 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
98
0
0
20 May 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Lihong Chen
Hossein Hassani
Soodeh Nikan
VLM
104
0
0
19 May 2025
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui
Cong Ma
Zongming Ma
SSL
101
0
0
18 May 2025
Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models
Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models
Aryan Das
Tanishq Rachamalla
Pravendra Singh
Koushik Biswas
Vinay Kumar Verma
Swalpa Kumar Roy
VLM
79
0
0
18 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
74
0
0
16 May 2025
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
Pengfei Wang
Guohai Xu
Weinong Wang
Junjie Yang
Jie Lou
Yunhua Xue
99
0
0
15 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
79
0
0
13 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDLVLM
100
0
0
08 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
303
1
0
05 May 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
157
0
0
30 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
183
0
0
29 Apr 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
50
0
0
27 Apr 2025
Multimodal graph representation learning for website generation based on visual sketch
Multimodal graph representation learning for website generation based on visual sketch
Tung D. Vu
Chung Hoang
Truong-Son Hy
3DV
103
0
0
25 Apr 2025
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
Shuanglin Yan
Neng Dong
Shuang Li
Rui Yan
Hao Tang
Jing Qin
435
0
0
25 Apr 2025
A Genealogy of Multi-Sensor Foundation Models in Remote Sensing
A Genealogy of Multi-Sensor Foundation Models in Remote Sensing
Kevin Lane
Morteza Karimzadeh
81
0
0
24 Apr 2025
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
Ali Anaissi
Junaid Akram
Kunal Chaturvedi
Ali Braytee
58
0
0
23 Apr 2025
FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
Hariseetharam Gunduboina
Muhammad Haris Khan
Biplab Banerjee
VLM
94
0
0
23 Apr 2025
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
Songtao Jiang
Yuan Wang
Sibo Song
Yanzhe Zhang
Zijie Meng
Bohan Lei
Jian Wu
Jimeng Sun
Zuozhu Liu
MedImVLM
95
3
0
20 Apr 2025
Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
Xiwen Li
Ross T. Whitaker
Tolga Tasdizen
58
0
0
15 Apr 2025
TSAL: Few-shot Text Segmentation Based on Attribute Learning
TSAL: Few-shot Text Segmentation Based on Attribute Learning
Chenming Li
Chengxu Liu
Yuanting Fan
Xiao Jin
Xingsong Hou
Xueming Qian
VLM
88
0
0
15 Apr 2025
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging
Siyuan Dai
Kai Ye
Guodong Liu
Haoteng Tang
Liang Zhan
MedIm
49
0
0
09 Apr 2025
DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion
DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion
Wei Huang
M. Liang
Peining Li
Xu Hou
Yawen Li
Junping Du
Zhe Xue
Zeli Guan
DiffM
75
0
0
09 Apr 2025
A Lightweight Large Vision-language Model for Multimodal Medical Images
A Lightweight Large Vision-language Model for Multimodal Medical Images
Belal Alsinglawi
Chris McCarthy
Sara Webb
Christopher Fluke
Navid Toosy Saidy
LM&MA
88
0
0
08 Apr 2025
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Runnan Fang
Xiaobin Wang
Yuan Liang
Shuofei Qiao
Jialong Wu
...
N. Zhang
Yong Jiang
Pengjun Xie
Fei Huang
Hong Chen
LLMAG
153
0
0
04 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
95
0
0
02 Apr 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Hongcheng Gao
Jiashu Qu
Jingyi Tang
Baolong Bi
Yi Liu
Hongyu Chen
Li Liang
Li Su
Qingming Huang
MLLMVLMLRM
156
6
0
25 Mar 2025
FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments
FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments
Sree Bhargavi Balija
FedML
92
0
0
25 Mar 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
151
1
0
23 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
95
1
0
21 Mar 2025
A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
Pengyu Liu
Guohua Dong
D. Guo
Kun Li
Fengling Li
Xun Yang
Meng Wang
Xiaomin Ying
AI4CE
83
0
0
20 Mar 2025
FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification
FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification
Jiadong Wang
Weiwei Song
Hao Chen
Jie Ren
Huimin Zhao
146
0
0
18 Mar 2025
FlowTok: Flowing Seamlessly Across Text and Image Tokens
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Ju He
Qihang Yu
Qihao Liu
Liang-Chieh Chen
150
1
0
13 Mar 2025
1234...222324
Next