Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.08530
Cited By
v1
v2
v3
v4 (latest)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (740★)
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,020 papers shown
Title
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
79
37
0
03 Mar 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
96
33
0
02 Mar 2022
Recent, rapid advancement in visual question answering architecture: a review
V. Kodali
Daniel Berleant
92
9
0
02 Mar 2022
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
Zihao Wang
Wei Liu
Qian He
Xin-ru Wu
Zili Yi
CLIP
VLM
260
75
0
01 Mar 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Mingyang Zhou
Licheng Yu
Amanpreet Singh
Mengjiao MJ Wang
Zhou Yu
Ning Zhang
VLM
82
31
0
01 Mar 2022
Multi-modal Alignment using Representation Codebook
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
101
68
0
28 Feb 2022
Joint Answering and Explanation for Visual Commonsense Reasoning
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Yin-wei Wei
Liqiang Nie
Mohan S. Kankanhalli
74
17
0
25 Feb 2022
Measuring CLEVRness: Blackbox testing of Visual Reasoning Models
Spyridon Mouselinos
Henryk Michalewski
Mateusz Malinowski
69
3
0
24 Feb 2022
A Survey of Vision-Language Pre-Trained Models
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
159
189
0
18 Feb 2022
AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint Modeling Method for Multi-task Misogyny Identification
Da Li
Ming Yi
Yukai He
24
1
0
18 Feb 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
183
227
0
18 Feb 2022
ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer
Kohei Uehara
Yusuke Mori
Yusuke Mukuta
Tatsuya Harada
93
6
0
15 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
93
39
0
15 Feb 2022
UserBERT: Modeling Long- and Short-Term User Preferences via Self-Supervision
Tianyu Li
Ali Cevahir
Derek Cho
Hao Gong
Duy Nguyen
B. Stenger
SSL
28
1
0
14 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Penglei Sun
Xuwu Wang
Yanghua Xiao
N. Yuan
73
167
0
11 Feb 2022
Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?
Jiawen Zhang
Abhijit Mishra
Avinesh P.V.S
Siddharth Patwardhan
Sachin Agarwal
75
0
0
09 Feb 2022
Robotic Grasping from Classical to Modern: A Survey
Hanbo Zhang
Jian Tang
Shiguang Sun
Xuguang Lan
93
41
0
08 Feb 2022
Universal Spam Detection using Transfer Learning of BERT Model
Vijay Srinivas Tida
Sonya Hsu
91
50
0
07 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
222
884
0
07 Feb 2022
A Frustratingly Simple Approach for End-to-End Image Captioning
Ziyang Luo
Yadong Xi
Rongsheng Zhang
Jing Ma
VLM
MLLM
79
16
0
30 Jan 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
78
18
0
29 Jan 2022
Can Wikipedia Help Offline Reinforcement Learning?
Machel Reid
Yutaro Yamada
S. Gu
3DV
RALM
OffRL
240
96
0
28 Jan 2022
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Emanuele Bugliarello
Fangyu Liu
Jonas Pfeiffer
Siva Reddy
Desmond Elliott
Edoardo Ponti
Ivan Vulić
MLLM
VLM
ELM
119
64
0
27 Jan 2022
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
Peixi Xiong
Yilin Shen
Hongxia Jin
35
5
0
25 Jan 2022
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
Peixi Xiong
Quanzeng You
Pei Yu
Zicheng Liu
Ying Wu
60
5
0
25 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Rohit Girdhar
Mannat Singh
Nikhil Ravi
Laurens van der Maaten
Armand Joulin
Ishan Misra
286
237
0
20 Jan 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
101
41
0
20 Jan 2022
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Jianwei Yang
Xiyang Dai
Bin Xiao
Haoxuan You
Shih-Fu Chang
Lu Yuan
CLIP
VLM
83
40
0
15 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
81
19
0
11 Jan 2022
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering
Ankur Sikarwar
Gabriel Kreiman
ViT
43
1
0
11 Jan 2022
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval
Zhixiong Zeng
Wenji Mao
VLM
64
18
0
08 Jan 2022
Automatic Related Work Generation: A Meta Study
Xiangci Li
Jessica Ouyang
110
10
0
06 Jan 2022
Discrete and continuous representations and processing in deep learning: Looking forward
Ruben Cartuyvels
Graham Spinks
Marie-Francine Moens
OCL
91
20
0
04 Jan 2022
Contrastive Learning of Semantic and Visual Representations for Text Tracking
Zhuang Li
Weijia Wu
Mike Zheng Shou
Jiahong Li
Size Li
Zhongyuan Wang
Hong Zhou
50
10
0
30 Dec 2021
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
Mengde Xu
Zheng Zhang
Fangyun Wei
Yutong Lin
Yue Cao
Han Hu
Xiang Bai
VLM
143
226
0
29 Dec 2021
Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?
Sedigheh Eslami
Gerard de Melo
Christoph Meinel
CLIP
MedIm
84
121
0
27 Dec 2021
Multi-Image Visual Question Answering
Harsh Raj
Janhavi Dadhania
Akhilesh Bhardwaj
Prabuchandran KJ
40
2
0
27 Dec 2021
LaTr: Layout-Aware Transformer for Scene-Text VQA
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
125
102
0
23 Dec 2021
Understanding and Measuring Robustness of Multimodal Learning
Nishant Vishwamitra
Hongxin Hu
Ziming Zhao
Long Cheng
Feng Luo
AAML
86
5
0
22 Dec 2021
Contrastive Vision-Language Pre-training with Limited Resources
Quan Cui
Boyan Zhou
Yu Guo
Weidong Yin
Hao Wu
Osamu Yoshie
Yubo Chen
VLM
CLIP
53
34
0
17 Dec 2021
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei
Haoqi Fan
Saining Xie
Chaoxia Wu
Alan Yuille
Christoph Feichtenhofer
ViT
196
677
0
16 Dec 2021
Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang
Wenhui Wang
Haichao Zhu
Ming Liu
Bing Qin
Furu Wei
VLM
FedML
85
33
0
16 Dec 2021
KAT: A Knowledge Augmented Transformer for Vision-and-Language
Liangke Gui
Borui Wang
Qiuyuan Huang
Alexander G. Hauptmann
Yonatan Bisk
Jianfeng Gao
75
162
0
16 Dec 2021
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Zhecan Wang
Haoxuan You
Liunian Harold Li
Alireza Zareian
Suji Park
Yiqing Liang
Kai-Wei Chang
Shih-Fu Chang
ReLM
LRM
69
33
0
16 Dec 2021
3D Question Answering
Shuquan Ye
Dongdong Chen
Songfang Han
Jing Liao
ViT
94
49
0
15 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
104
118
0
14 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
74
42
0
14 Dec 2021
ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval
Boxuan Zhang
Chao Wei
Yang Jin
Weiru Zhang
55
2
0
14 Dec 2021
Co-training Transformer with Videos and Images Improves Action Recognition
Bowen Zhang
Jiahui Yu
Christopher Fifty
Wei Han
Andrew M. Dai
Ruoming Pang
Fei Sha
ViT
83
54
0
14 Dec 2021
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
Qing Li
Boqing Gong
Huayu Chen
Dan Kondratyuk
Xianzhi Du
Ming-Hsuan Yang
Matthew A. Brown
ViT
49
17
0
14 Dec 2021
Previous
1
2
3
...
12
13
14
...
19
20
21
Next