ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,200 papers shown
Title
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large
  Language and Vision Models
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
Juseong Jin
Chang Wook Jeong
74
3
0
13 Oct 2024
Robust 3D Point Clouds Classification based on Declarative Defenders
Robust 3D Point Clouds Classification based on Declarative Defenders
Kaidong Li
Tianxiao Zhang
Cuncong Zhong
Zizhuo Zhang
G. Wang
3DPC
74
1
0
13 Oct 2024
A Social Context-aware Graph-based Multimodal Attentive Learning
  Framework for Disaster Content Classification during Emergencies
A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies
Shahid Shafi Dar
Mohammad Zia Ur Rehman
Karan Bais
Mohammed Abdul Haseeb
Nagendra Kumara
79
13
0
11 Oct 2024
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Dianzhi Yu
Xinni Zhang
Yankai Chen
Aiwei Liu
Yifei Zhang
Philip S. Yu
Irwin King
VLMCLL
101
13
0
07 Oct 2024
Fine-Grained Prediction of Reading Comprehension from Eye Movements
Fine-Grained Prediction of Reading Comprehension from Eye Movements
Omer Shubi
Yoav Meiri
Cfir Avraham Hadar
Yevgeni Berzak
78
5
0
06 Oct 2024
DAViD: Domain Adaptive Visually-Rich Document Understanding with
  Synthetic Insights
DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights
Yihao Ding
S. Han
Zechuan Li
Hyunsuk Chung
70
2
0
02 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
175
0
0
01 Oct 2024
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLMAuLLM
173
12
0
26 Sep 2024
Global-Local Medical SAM Adaptor Based on Full Adaption
Global-Local Medical SAM Adaptor Based on Full Adaption
Meng Wang
Yarong Feng
Yongwei Tang
Tian Zhang
Yuxin Liang
Chao Lv
MedIm
50
1
0
26 Sep 2024
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Raja Kumar
Raghav Singhal
Pranamya Kulkarni
Deval Mehta
Kshitij S. Jadhav
81
0
0
26 Sep 2024
GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design
GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design
Phillip Mueller
Sebastian Mueller
Lars Mikelsons
112
2
0
25 Sep 2024
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic
  Segmentation
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
Soojin Jang
Jungmin Yun
Junehyoung Kwon
Eunju Lee
Youngbin Kim
104
3
0
24 Sep 2024
Embodiment-Agnostic Action Planning via Object-Part Scene Flow
Embodiment-Agnostic Action Planning via Object-Part Scene Flow
Weiliang Tang
Jia-Hui Pan
Wei Zhan
Jianshu Zhou
Huaxiu Yao
Yun-Hui Liu
Masayoshi Tomizuka
Mingyu Ding
Chi-Wing Fu
122
1
0
16 Sep 2024
Generalization Boosted Adapter for Open-Vocabulary Segmentation
Generalization Boosted Adapter for Open-Vocabulary Segmentation
Wenhao Xu
Changwei Wang
Xuxiang Feng
Rongtao Xu
Longzhao Huang
Zherui Zhang
Li Guo
Shibiao Xu
VLM
87
3
0
13 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGeVLM
61
0
0
12 Sep 2024
VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage
  $\underline{P}$re-training Framework for $\underline{Ro}$botic and
  Laparoscopic Surgery
VidLPRO: A Vid‾\underline{Vid}Vid​eo-L‾\underline{L}L​anguage P‾\underline{P}P​re-training Framework for Ro‾\underline{Ro}Ro​botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
145
2
0
07 Sep 2024
MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with
  Missing Modality
MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality
Ruiting Dai
Yuqiao Tan
Lisi Mo
Tao He
Ke Qin
Shuang Liang
VLM
77
3
0
07 Sep 2024
TG-LMM: Enhancing Medical Image Segmentation Accuracy through
  Text-Guided Large Multi-Modal Model
TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model
Yihao Zhao
Enhao Zhong
Cuiyun Yuan
Yang Li
Man Zhao
Chunxia Li
Jun Hu
Chenbin Liu
VLMMedIm
91
0
0
05 Sep 2024
CV-Probes: Studying the interplay of lexical and world knowledge in
  visually grounded verb understanding
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
Ivana Beňová
Michal Gregor
Albert Gatt
74
1
0
02 Sep 2024
COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation
COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation
Munish Monga
Sachin Kumar Giroh
Ankit Jha
Mainak Singha
Biplab Banerjee
Jocelyn Chanussot
111
2
0
31 Aug 2024
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with
  Multi-Pass Augmented Generative Error Correction
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
Yuka Ko
Sheng Li
Chao-Han Huck Yang
Tatsuya Kawahara
AuLLM
43
4
0
29 Aug 2024
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages
  in Multimodal Image Retrieval Task
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
Ali Asgarov
Samir Rustamov
VLM
36
1
0
25 Aug 2024
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in
  Visual Question Answering
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering
Yuliang Cai
Mohammad Rostami
CLLVLMMLLM
128
4
0
21 Aug 2024
C${^2}$RL: Content and Context Representation Learning for Gloss-free
  Sign Language Translation and Retrieval
C2{^2}2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLMSLR
68
3
0
19 Aug 2024
Modality Invariant Multimodal Learning to Handle Missing Modalities: A
  Single-Branch Approach
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach
Muhammad Saad Saeed
Shah Nawaz
Muhammad Zaigham Zaheer
Muhammad Haris Khan
Karthik Nandakumar
Muhammad Haroon Yousaf
Hassan Sajjad
Tom De Schepper
Markus Schedl
93
0
0
14 Aug 2024
Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust
  Visual Question-Localized Answering in Robotic Surgery
Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery
Long Bai
Guankun Wang
Mobarakol Islam
Lalithkumar Seenivasan
An-Chi Wang
Hongliang Ren
104
17
0
09 Aug 2024
Deep Learning based Visually Rich Document Content Understanding: A Survey
Deep Learning based Visually Rich Document Content Understanding: A Survey
Muhammad Ali
Jean Lee
Salman Khan
Eduard Hovy
111
6
0
02 Aug 2024
Towards Flexible Evaluation for Generative Visual Question Answering
Towards Flexible Evaluation for Generative Visual Question Answering
Huishan Ji
Q. Si
Zheng Lin
Weiping Wang
87
1
0
01 Aug 2024
Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal
  Nuances
Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances
Mieko Ochi
Ziwei Gong
D. Komura
Pengyuan Shi
Kaan Donbekci
Julia Hirschberg
107
16
0
31 Jul 2024
PIXELMOD: Improving Soft Moderation of Visual Misleading Information on
  Twitter
PIXELMOD: Improving Soft Moderation of Visual Misleading Information on Twitter
Pujan Paudel
Chen Ling
Jeremy Blackburn
Gianluca Stringhini
68
1
0
30 Jul 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
90
2
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
99
0
0
28 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
58
0
0
23 Jul 2024
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming
  Product Retrieval
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu
Yiyi Chen
Yan Li
Minquan Wang
Haoqian Wang
Quan Chen
Han Li
Peng Jiang
AI4TS
78
0
0
23 Jul 2024
Chameleon: Images Are What You Need For Multimodal Learning Robust To
  Missing Modalities
Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities
Muhammad Irzam Liaqat
Shah Nawaz
Muhammad Zaigham Zaheer
M. S. Saeed
Hassan Sajjad
Tom De Schepper
Karthik Nandakumar
Muhammad Haris Khan
96
1
0
23 Jul 2024
MuTT: A Multimodal Trajectory Transformer for Robot Skills
MuTT: A Multimodal Trajectory Transformer for Robot Skills
Claudius Kienle
Benjamin Alt
Onur Celik
P. Becker
Darko Katic
Rainer Jäkel
Gerhard Neumann
70
2
0
22 Jul 2024
Benchmark Granularity and Model Robustness for Image-Text Retrieval
Benchmark Granularity and Model Robustness for Image-Text Retrieval
Mariya Hendriksen
Shuo Zhang
R. Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke
75
0
0
21 Jul 2024
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models
  Through 3D Reconstruction
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction
Zaiqiao Meng
Hao Zhou
Yifang Chen
66
4
0
19 Jul 2024
Towards Zero-Shot Multimodal Machine Translation
Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral
Cordelia Schmid
Benoît Sagot
Rachel Bawden
106
4
0
18 Jul 2024
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
  Vision-Language Models
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Gengze Zhou
Yicong Hong
Zun Wang
Xin Eric Wang
Qi Wu
LM&Ro
96
30
0
17 Jul 2024
RepVF: A Unified Vector Fields Representation for Multi-task 3D
  Perception
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception
Chunliang Li
Wencheng Han
Junbo Yin
Sanyuan Zhao
Jianbing Shen
86
4
0
15 Jul 2024
How and where does CLIP process negation?
How and where does CLIP process negation?
Vincent Quantmeyer
Pablo Mosteiro
Albert Gatt
CoGe
73
9
0
15 Jul 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text
  Retrieval?
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Zhongang Qi
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
62
2
0
10 Jul 2024
A Single Transformer for Scalable Vision-Language Modeling
A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen
Xingyao Wang
Hao Peng
Heng Ji
LRM
107
17
0
08 Jul 2024
AI as a Tool for Fair Journalism: Case Studies from Malta
AI as a Tool for Fair Journalism: Case Studies from Malta
Dylan Seychell
Gabriel Hili
Jonathan Attard
Konstantinos Makantatis
31
3
0
08 Jul 2024
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal
  Prompt Learning
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning
Mainak Singha
Ankit Jha
Divyam Gupta
Pranav Singla
Biplab Banerjee
VLM
92
1
0
05 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
96
15
0
03 Jul 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for
  Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLMLRM
132
3
0
25 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
130
3
0
24 Jun 2024
Towards Natural Language-Driven Assembly Using Foundation Models
Towards Natural Language-Driven Assembly Using Foundation Models
O. Joglekar
Tal Lancewicki
Shir Kozlovsky
Vladimir Tchuiev
Zohar Feldman
Dotan Di Castro
LM&Ro
75
0
0
23 Jun 2024
Previous
123456...222324
Next