ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,200 papers shown
Title
Video-Teller: Enhancing Cross-Modal Generation with Fusion and
  Decoupling
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling
Haogeng Liu
Qihang Fan
Tingkai Liu
Linjie Yang
Yunzhe Tao
Huaibo Huang
Ran He
Hongxia Yang
VGen
55
12
0
08 Oct 2023
Understanding the Robustness of Multi-modal Contrastive Learning to
  Distribution Shift
Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
Yihao Xue
Siddharth Joshi
Dang Nguyen
Baharan Mirzasoleiman
VLM
69
4
0
08 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video
  Understanding Tasks
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
69
0
0
07 Oct 2023
SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
  Question Answering
SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering
Bruno Souza
Marius Aasan
Hélio Pedrini
Adín Ramirez Rivera
SSL
86
2
0
03 Oct 2023
PROSE: Predicting Operators and Symbolic Expressions using Multimodal
  Transformers
PROSE: Predicting Operators and Symbolic Expressions using Multimodal Transformers
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
81
18
0
28 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
69
3
0
28 Sep 2023
Rapid Network Adaptation: Learning to Adapt Neural Networks Using
  Test-Time Feedback
Rapid Network Adaptation: Learning to Adapt Neural Networks Using Test-Time Feedback
Teresa Yeo
Oğuzhan Fatih Kar
Zahra Sodagar
Amir Zamir
TTAOOD
74
4
0
27 Sep 2023
Survey of Social Bias in Vision-Language Models
Survey of Social Bias in Vision-Language Models
Nayeon Lee
Yejin Bang
Holy Lovenia
Samuel Cahyawijaya
Wenliang Dai
Pascale Fung
VLM
126
19
0
24 Sep 2023
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
Xin Li
Dongze Lian
Zhihe Lu
Jiawang Bai
Zhibo Chen
Xinchao Wang
VLM
108
66
0
24 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging
  Image-Text Auxiliary Tasks
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
64
3
0
14 Sep 2023
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
Anh Pham Thi Minh
An Duc Nguyen
Georgios Tzimiropoulos
VPVLMVLM
85
3
0
14 Sep 2023
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Palaash Agrawal
Haidi Azaman
Cheston Tan
158
3
0
13 Sep 2023
Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed
  Hinglish Memes
Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed Hinglish Memes
Shreyash Mishra
S. Suryavardan
Megha Chakraborty
Parth Patwa
Anku Rani
...
Amitava Das
A. Sheth
Manoj Kumar Chinnakotla
Asif Ekbal
Srijan Kumar
51
5
0
12 Sep 2023
Multi-modal Extreme Classification
Multi-modal Extreme Classification
Anshul Mittal
Kunal Dahiya
Shreya Malani
Janani Ramaswamy
Seba Kuruvilla
Jitendra Ajmera
Keng-hao Chang
Sumeet Agarwal
Purushottam Kar
Manik Varma
89
8
0
10 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language
  Models
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
99
27
0
08 Sep 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and
  Language Models
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
70
2
0
06 Sep 2023
Parameter and Computation Efficient Transfer Learning for
  Vision-Language Pre-trained Models
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Qiong Wu
Wei Yu
Yiyi Zhou
Shubin Huang
Xiaoshuai Sun
Rongrong Ji
VLM
86
7
0
04 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person
  Re-identification
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
95
19
0
04 Sep 2023
Multimodal Contrastive Learning with Hard Negative Sampling for Human
  Activity Recognition
Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition
Hyeongju Choi
Apoorva Beedu
Irfan Essa
SSL
71
3
0
03 Sep 2023
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
  Large Model
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model
Fengxiang Bie
Yibo Yang
Zhongzhu Zhou
Adam Ghanem
Minjia Zhang
...
Pareesa Ameneh Golnari
David A. Clifton
Yuxiong He
Dacheng Tao
Shuaiwen Leon Song
EGVM
105
25
0
02 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual
  Augmentation
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Learning with Multi-modal Gradient Attention for Explainable Composed
  Image Retrieval
Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval
Prateksha Udhayanan
Srikrishna Karanam
Balaji Vasan Srinivasan
50
2
0
31 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
75
5
0
30 Aug 2023
CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
  Multimodal Machine Translation
CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation
Devaansh Gupta
Siddhant Kharbanda
Jiawei Zhou
Wanhua Li
Hanspeter Pfister
D. Wei
VLM
86
13
0
29 Aug 2023
Conformal Meta-learners for Predictive Inference of Individual Treatment
  Effects
Conformal Meta-learners for Predictive Inference of Individual Treatment Effects
Ahmed Alaa
Zaid Ahmad
Mark van der Laan
CML
197
16
0
28 Aug 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
  Pretraining?
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Fei Wang
Liang Ding
Jun Rao
Ye Liu
Li Shen
Changxing Ding
92
15
0
24 Aug 2023
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
  Navigation
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
Yibo Cui
Liang Xie
Yakun Zhang
Meishan Zhang
Ye Yan
Erwei Yin
LM&Ro
87
17
0
24 Aug 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and
  Modality-Aware MoE
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLMVLMMoE
73
10
0
23 Aug 2023
Multi-event Video-Text Retrieval
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
85
14
0
22 Aug 2023
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
  Language-Image Pre-training
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Xi Deng
Han Shi
Runhu Huang
Changlin Li
Hang Xu
Jianhua Han
James T. Kwok
Shen Zhao
Wei Zhang
Xiaodan Liang
CLIPVLM
91
3
0
22 Aug 2023
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models
Baoshuo Kan
Teng Wang
Wenpeng Lu
Xiantong Zhen
Weili Guan
Feng Zheng
VPVLMVLM
90
26
0
22 Aug 2023
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal
  Heterogeneous Federated Learning
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Haokun Chen
Yao Zhang
Denis Krompass
Jindong Gu
Volker Tresp
FedML
114
54
0
21 Aug 2023
Generic Attention-model Explainability by Weighted Relevance
  Accumulation
Generic Attention-model Explainability by Weighted Relevance Accumulation
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
46
1
0
20 Aug 2023
Causal Intersectionality and Dual Form of Gradient Descent for
  Multimodal Analysis: a Case Study on Hateful Memes
Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes
Yosuke Miyanishi
Minh Le Nguyen
97
2
0
19 Aug 2023
Vision Relation Transformer for Unbiased Scene Graph Generation
Vision Relation Transformer for Unbiased Scene Graph Generation
Gopika Sudhakaran
Devendra Singh Dhami
Kristian Kersting
Stefan Roth
ViT
117
18
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language
  Representation Learning
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
74
1
0
18 Aug 2023
Lip Reading for Low-resource Languages by Learning and Combining General
  Speech Knowledge and Language-specific Knowledge
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
Minsu Kim
Jeong Hun Yeo
J. Choi
Y. Ro
73
17
0
18 Aug 2023
DiffDis: Empowering Generative Diffusion Model with Cross-Modal
  Discrimination Capability
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
Runhu Huang
Jianhua Han
Guansong Lu
Xiaodan Liang
Yihan Zeng
Wei Zhang
Hang Xu
DiffM
62
2
0
18 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
119
19
0
16 Aug 2023
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme
  Detection
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
Rui Cao
Ming Shan Hee
Adriel Kuek
Wen-Haw Chong
Roy Ka-wei Lee
Jing Jiang
VLMMLLM
56
43
0
16 Aug 2023
UniTR: A Unified and Efficient Multi-Modal Transformer for
  Bird's-Eye-View Representation
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation
Haiyang Wang
Hao Tang
Shaoshuai Shi
Aoxue Li
Zhenguo Li
Bernt Schiele
Liwei Wang
ViT
123
56
0
15 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
  Compressing Audio Knowledge of a Pretrained Model
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
44
19
0
15 Aug 2023
Thinking Like an Expert:Multimodal Hypergraph-of-Thought (HoT) Reasoning
  to boost Foundation Modals
Thinking Like an Expert:Multimodal Hypergraph-of-Thought (HoT) Reasoning to boost Foundation Modals
Fanglong Yao
Changyuan Tian
Jintao Liu
Zequn Zhang
Qing Liu
Li Jin
Shuchao Li
Xiaoyu Li
Xian Sun
ReLMLRM
74
17
0
11 Aug 2023
AD-CLIP: Adapting Domains in Prompt Space Using CLIP
AD-CLIP: Adapting Domains in Prompt Space Using CLIP
Mainak Singha
Harsh Pal
Ankit Jha
Biplab Banerjee
VLM
75
53
0
10 Aug 2023
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
Xuehan Bai
Yan Li
Yong Cheng
Wenjie Yang
Quanming Chen
Han Li
61
4
0
10 Aug 2023
Beyond First Impressions: Integrating Joint Multi-modal Cues for
  Comprehensive 3D Representation
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Haowei Wang
Jiji Tang
Jiayi Ji
Xiaoshuai Sun
Rongsheng Zhang
...
Minda Zhao
Lincheng Li
zeng zhao
Tangjie Lv
Rongrong Ji
3DV
99
16
0
06 Aug 2023
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language
  Models
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
Cheng-Yu Hsieh
Sibei Chen
Chun-Liang Li
Yasuhisa Fujii
Alexander Ratner
Chen-Yu Lee
Ranjay Krishna
Tomas Pfister
LLMAGSyDa
148
44
0
01 Aug 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
  Complex Visual Reasoning Tasks
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
58
2
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMeMLLM
126
46
0
30 Jul 2023
Previous
123...789...222324
Next