ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,200 papers shown
Title
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
  Structured Representations
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
125
25
0
06 May 2023
Fairness in Image Search: A Study of Occupational Stereotyping in Image
  Retrieval and its Debiasing
Fairness in Image Search: A Study of Occupational Stereotyping in Image Retrieval and its Debiasing
Swagatika Dash
38
0
0
06 May 2023
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large
  Language Model Signals for Science Question Answering
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering
Lei Wang
Yilang Hu
Jiabang He
Xingdong Xu
Ning Liu
Hui-juan Liu
Hengtao Shen
LRMMLLM
116
48
0
05 May 2023
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from
  Linguistically Complex Text
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text
Yunxin Li
Baotian Hu
Yuxin Ding
Lin Ma
Hao Fei
74
5
0
03 May 2023
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Qiuyuan Huang
Jinho Park
Abhinav Gupta
Paul N. Bennett
Ran Gong
...
Baolin Peng
O. Mohammed
C. Pal
Yejin Choi
Jianfeng Gao
119
6
0
01 May 2023
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao
Jiaming Han
Renrui Zhang
Ziyi Lin
Shijie Geng
...
Pan Lu
Conghui He
Xiangyu Yue
Hongsheng Li
Yu Qiao
MLLM
118
588
0
28 Apr 2023
An Empirical Study of Multimodal Model Merging
An Empirical Study of Multimodal Model Merging
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
118
42
0
28 Apr 2023
$π$-Tuning: Transferring Multimodal Foundation Models with Optimal
  Multi-task Interpolation
πππ-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation
Chengyue Wu
Teng Wang
Yixiao Ge
Zeyu Lu
Rui-Zhi Zhou
Ying Shan
Ping Luo
MoMe
145
37
0
27 Apr 2023
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables
Matthias Urban
Carsten Binnig
71
5
0
26 Apr 2023
Sample-Specific Debiasing for Better Image-Text Models
Sample-Specific Debiasing for Better Image-Text Models
Peiqi Wang
Yingcheng Liu
Ching-Yun Ko
W. Wells
Seth Berkowitz
Steven Horng
Polina Golland
SSLMedIm
111
1
0
25 Apr 2023
Hypernymization of named entity-rich captions for grounding-based
  multi-modal pretraining
Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining
Giacomo Nebbia
Adriana Kovashka
103
0
0
25 Apr 2023
SurgicalGPT: End-to-End Language-Vision GPT for Visual Question
  Answering in Surgery
SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery
Lalithkumar Seenivasan
Mobarakol Islam
Gokul Kannan
Hongliang Ren
86
43
0
19 Apr 2023
Chameleon: Plug-and-Play Compositional Reasoning with Large Language
  Models
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Pan Lu
Baolin Peng
Hao Cheng
Michel Galley
Kai-Wei Chang
Ying Nian Wu
Song-Chun Zhu
Jianfeng Gao
KELMMLLMLRM
155
325
0
19 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
63
13
0
18 Apr 2023
Learning Situation Hyper-Graphs for Video Question Answering
Learning Situation Hyper-Graphs for Video Question Answering
Aisha Urooj Khan
Hilde Kuehne
Bo Wu
Kim Chheu
Walid Bousselham
Chuang Gan
N. Lobo
M. Shah
90
16
0
18 Apr 2023
Towards Robust Prompts on Vision-Language Models
Towards Robust Prompts on Vision-Language Models
Jindong Gu
Ahmad Beirami
Xuezhi Wang
Alex Beutel
Philip Torr
Yao Qin
VLMVPVLM
86
8
0
17 Apr 2023
Interpretable Detection of Out-of-Context Misinformation with
  Neural-Symbolic-Enhanced Large Multimodal Model
Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model
Yizhou Zhang
Loc Trinh
Defu Cao
Zijun Cui
Yang Liu
68
9
0
15 Apr 2023
CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval
Yang Yang
Zhongtian Fu
Xiangyu Wu
Wenjie Li
VLM
63
1
0
15 Apr 2023
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic
  Segmentation
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation
Jingyao Li
Pengguang Chen
Shengju Qian
Jiaya Jia
VLM
80
13
0
15 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
  Segmentation
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
77
11
0
14 Apr 2023
PDFVQA: A New Dataset for Real-World VQA on PDF Documents
PDFVQA: A New Dataset for Real-World VQA on PDF Documents
Yihao Ding
Siwen Luo
Hyunsuk Chung
S. Han
103
18
0
13 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and
  Language
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIPVLM
55
1
0
10 Apr 2023
Enhancing Multimodal Entity and Relation Extraction with Variational
  Information Bottleneck
Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck
Shiyao Cui
Jiangxia Cao
Xin Cong
Shuaiyi Nie
Quangang Li
Tingwen Liu
Jinqiao Shi
67
25
0
05 Apr 2023
G2PTL: A Pre-trained Model for Delivery Address and its Applications in
  Logistics System
G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System
Lixia Wu
Jianlin Liu
Junhong Lou
Haoyuan Hu
Jianbin Zheng
Haomin Wen
Chao Song
Shu He
VLM
66
5
0
04 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
125
50
0
31 Mar 2023
SemiMemes: A Semi-supervised Learning Approach for Multimodal Memes
  Analysis
SemiMemes: A Semi-supervised Learning Approach for Multimodal Memes Analysis
Pham Thai Hoang Tung
Nguyen Tan Viet
Ngo Tien Anh
P. D. Hung
34
6
0
31 Mar 2023
Dual Cross-Attention for Medical Image Segmentation
Dual Cross-Attention for Medical Image Segmentation
Gorkem Can Ates
P. Mohan
Emrah Çelik
56
85
0
30 Mar 2023
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger
Yuting Gao
Jinfeng Liu
Zi-Han Xu
Tong Wu
Wen Liu
Jie Yang
Keren Li
Xingen Sun
CLIPVLM
64
47
0
30 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
130
9
0
30 Mar 2023
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Sifan Long
Zhen Zhao
Junkun Yuan
Zichang Tan
Jiangjiang Liu
Luping Zhou
Sheng-sheng Wang
Jingdong Wang
VLM
113
3
0
30 Mar 2023
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
  Mask Annotations
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations
VS Vibashan
Ning Yu
Chen Xing
Can Qin
M. Gao
Juan Carlos Niebles
Vishal M. Patel
Ran Xu
VLMISeg
78
18
0
29 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial
  Grounding
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
62
0
0
28 Mar 2023
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init
  Attention
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang
Jiaming Han
Chris Liu
Peng Gao
Aojun Zhou
Xiangfei Hu
Shilin Yan
Pan Lu
Hongsheng Li
Yu Qiao
MLLM
179
787
0
28 Mar 2023
Egocentric Auditory Attention Localization in Conversations
Egocentric Auditory Attention Localization in Conversations
Fiona Ryan
Hao Jiang
Abhinav Shukla
James M. Rehg
V. Ithapu
EgoV
70
16
0
28 Mar 2023
Curriculum Learning for Compositional Visual Reasoning
Curriculum Learning for Compositional Visual Reasoning
Wafa Aissa
Marin Ferecatu
M. Crucianu
LRM
82
3
0
27 Mar 2023
Equivariant Similarity for Vision-Language Foundation Models
Equivariant Similarity for Vision-Language Foundation Models
Tan Wang
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Zhengyuan Yang
Hanwang Zhang
Zicheng Liu
Lijuan Wang
CoGe
83
51
0
25 Mar 2023
VILA: Learning Image Aesthetics from User Comments with Vision-Language
  Pretraining
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
Junjie Ke
Keren Ye
Jiahui Yu
Yonghui Wu
P. Milanfar
Feng Yang
VLM
102
61
0
24 Mar 2023
Accelerating Vision-Language Pretraining with Free Language Modeling
Accelerating Vision-Language Pretraining with Free Language Modeling
Teng Wang
Yixiao Ge
Feng Zheng
Ran Cheng
Ying Shan
Xiaohu Qie
Ping Luo
VLMMLLM
113
10
0
24 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
81
13
0
23 Mar 2023
VideoXum: Cross-modal Visual and Textural Summarization of Videos
VideoXum: Cross-modal Visual and Textural Summarization of Videos
Jingyang Lin
Hang Hua
Ming Chen
Yikang Li
Jenhao Hsiao
C. Ho
Jiebo Luo
106
33
0
21 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
167
48
0
21 Mar 2023
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D
  Recognition
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition
Deepti Hegde
Jeya Maria Jose Valanarasu
Vishal M. Patel
CLIP
118
68
0
20 Mar 2023
Label Name is Mantra: Unifying Point Cloud Segmentation across
  Heterogeneous Datasets
Label Name is Mantra: Unifying Point Cloud Segmentation across Heterogeneous Datasets
Yixun Liang
Hao He
Shishi Xiao
Hao Lu
Yingke Chen
3DPC
47
3
0
19 Mar 2023
DeAR: Debiasing Vision-Language Models with Additive Residuals
DeAR: Debiasing Vision-Language Models with Additive Residuals
Ashish Seth
Mayur Hemani
Chirag Agarwal
VLM
62
56
0
18 Mar 2023
PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point
  Clouds
PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds
Sauradip Nag
Anran Qi
Xiatian Zhu
Ariel Shamir
3DPC
65
7
0
17 Mar 2023
MultiModal Bias: Introducing a Framework for Stereotypical Bias
  Assessment beyond Gender and Race in Vision Language Models
MultiModal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision Language Models
Sepehr Janghorbani
Gerard de Melo
VLM
103
12
0
16 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLMMoE
77
68
0
13 Mar 2023
Understanding and Constructing Latent Modality Structures in Multi-modal
  Representation Learning
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
Qian Jiang
Changyou Chen
Han Zhao
Liqun Chen
Q. Ping
S. D. Tran
Yi Xu
Belinda Zeng
Trishul Chilimbi
97
43
0
10 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal
  Pre-training
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
57
1
0
09 Mar 2023
Toward Unsupervised Realistic Visual Question Answering
Toward Unsupervised Realistic Visual Question Answering
Yuwei Zhang
Chih-Hui Ho
Nuno Vasconcelos
CoGe
85
2
0
09 Mar 2023
Previous
123...101112...222324
Next