Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.08530
Cited By
v1
v2
v3
v4 (latest)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (740★)
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,020 papers shown
Title
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
98
35
0
18 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
89
84
0
16 Mar 2021
SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels
Chenliang Li
Ming Yan
Haiyang Xu
Fuli Luo
Wei Wang
Bin Bi
Songfang Huang
VLM
74
36
0
14 Mar 2021
What is Multimodality?
Letitia Parcalabescu
Nils Trost
Anette Frank
56
0
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
Andrew Shin
Masato Ishii
T. Narihira
140
39
0
06 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
286
322
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Wei Lin
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
152
134
0
01 Mar 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts Go
Arnav Arora
Preslav Nakov
Momchil Hardalov
Sheikh Muhammad Sarwar
Vibha Nayak
...
Dimitrina Zlatkova
Kyle Dent
Ameya Bhatawdekar
Guillaume Bouchard
Isabelle Augenstein
90
53
0
27 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
106
301
0
22 Feb 2021
Learning Compositional Representation for Few-shot Visual Question Answering
Dalu Guo
Dacheng Tao
OOD
CoGe
64
4
0
21 Feb 2021
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
VLM
164
227
0
20 Feb 2021
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Rafal Powalski
Łukasz Borchmann
Dawid Jurkiewicz
Tomasz Dwojak
Michal Pietruszka
Gabriela Pałka
ViT
94
160
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
523
1,143
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
179
665
0
11 Feb 2021
The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point
Magnus Sahlgren
F. Carlsson
55
28
0
08 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
73
10
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
187
1,773
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
82
141
0
05 Feb 2021
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
Jwala Dhamala
Tony Sun
Varun Kumar
Satyapriya Krishna
Yada Pruksachatkun
Kai-Wei Chang
Rahul Gupta
94
403
0
27 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
95
53
0
27 Jan 2021
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Ozan Caglayan
Menekse Kuyu
Mustafa Sercan Amac
Pranava Madhyastha
Erkut Erdem
Aykut Erdem
Lucia Specia
VLM
71
46
0
25 Jan 2021
Understanding in Artificial Intelligence
S. Maetschke
D. M. Iraola
Pieter Barnard
Elaheh Shafieibavani
Peter Zhong
Ying Xu
Antonio Jimeno Yepes
ELM
VLM
46
0
0
17 Jan 2021
Latent Variable Models for Visual Question Answering
Zixu Wang
Yishu Miao
Lucia Specia
137
5
0
16 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
83
28
0
15 Jan 2021
Latent Alignment of Procedural Concepts in Multimodal Recipes
Hossein Rajaby Faghihi
Roshanak Mirzaee
Sudarshan Paliwal
Parisa Kordjamshidi
35
3
0
12 Jan 2021
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
387
2,567
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjD
VLM
347
158
0
02 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
142
382
0
31 Dec 2020
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
55
0
0
30 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
93
30
0
30 Dec 2020
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
...
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
ViT
MLLM
240
522
0
29 Dec 2020
An Embarrassingly Simple Model for Dialogue Relation Extraction
Fuzhao Xue
Aixin Sun
Hao Zhang
Jinjie Ni
Eng Siong Chng
69
28
0
27 Dec 2020
Detecting Hateful Memes Using a Multimodal Deep Ensemble
Vlad Sandulescu
VLM
74
44
0
24 Dec 2020
A Survey on Visual Transformer
Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
...
Chunjing Xu
Yixing Xu
Zhaohui Yang
Yiman Zhang
Dacheng Tao
ViT
233
2,278
0
23 Dec 2020
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu
Albert Gatt
Anette Frank
Iacer Calixto
LRM
95
49
0
22 Dec 2020
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Nevan Wichers
Gabriel Schubiner
Ruby B. Lee
Jindong Chen
Blaise Agüera y Arcas
107
80
0
22 Dec 2020
Object-Centric Diagnosis of Visual Reasoning
Jianwei Yang
Jiayuan Mao
Jiajun Wu
Devi Parikh
David D. Cox
J. Tenenbaum
Chuang Gan
OCL
82
16
0
21 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
128
188
0
20 Dec 2020
Transformer Interpretability Beyond Attention Visualization
Hila Chefer
Shir Gur
Lior Wolf
145
676
0
17 Dec 2020
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification
Te-Lin Wu
Shikhar Singh
S. Paul
Gully A. Burns
Nanyun Peng
39
18
0
16 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
96
44
0
15 Dec 2020
Attention over learned object embeddings enables complex visual reasoning
David Ding
Felix Hill
Adam Santoro
Malcolm Reynolds
M. Botvinick
OCL
114
71
0
15 Dec 2020
Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution
Ron Zhu
73
81
0
15 Dec 2020
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding
Qingxing Cao
Bailin Li
Xiaodan Liang
Keze Wang
Liang Lin
89
36
0
14 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSL
LRM
89
39
0
13 Dec 2020
MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
Lefei Zhang
Jianfeng Gao
Zicheng Liu
VLM
MLLM
133
60
0
13 Dec 2020
Topological Planning with Transformers for Vision-and-Language Navigation
Kevin Chen
Junshen K. Chen
Jo Chuang
Nathan Tsoi
Silvio Savarese
LM&Ro
95
101
0
09 Dec 2020
Hateful Memes Detection via Complementary Visual and Linguistic Networks
W. Zhang
Guihua Liu
Zhuohua Li
Fuqing Zhu
62
19
0
09 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
107
144
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
104
78
0
08 Dec 2020
Previous
1
2
3
...
17
18
19
20
21
Next