Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 1,180 papers shown
Title
InfographicVQA
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
39
206
0
26 Apr 2021
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images
Dimitar Dimitrov
Bishr Bin Ali
Shaden Shaar
Firoj Alam
Fabrizio Silvestri
Hamed Firooz
Preslav Nakov
Giovanni Da San Martino
21
103
0
25 Apr 2021
MusCaps: Generating Captions for Music Audio
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
30
36
0
24 Apr 2021
Playing Lottery Tickets with Vision and Language
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
109
54
0
23 Apr 2021
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
63
1,224
0
22 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
35
23
0
20 Apr 2021
BM-NAS: Bilevel Multimodal Neural Architecture Search
Yihang Yin
Siyu Huang
Xiang Zhang
32
27
0
19 Apr 2021
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Yiheng Xu
Tengchao Lv
Lei Cui
Guoxin Wang
Yijuan Lu
D. Florêncio
Cha Zhang
Furu Wei
MLLM
VLM
38
127
0
18 Apr 2021
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding
Te-Lin Wu
Cheng-rong Li
Mingyang Zhang
Tao Chen
Spurthi Amba Hombaiah
Michael Bendersky
21
14
0
16 Apr 2021
Cross-Modal Retrieval Augmentation for Multi-Modal Classification
Shir Gur
Natalia Neverova
C. Stauffer
Ser-Nam Lim
Douwe Kiela
A. Reiter
19
26
0
16 Apr 2021
Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
Taichi Iki
Akiko Aizawa
VLM
27
20
0
16 Apr 2021
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
Grace Luo
Trevor Darrell
Anna Rohrbach
8
84
0
13 Apr 2021
Non-autoregressive Transformer-based End-to-end ASR using BERT
Fu-Hao Yu
Kuan-Yu Chen
27
22
0
10 Apr 2021
How Transferable are Reasoning Patterns in VQA?
Corentin Kervadec
Theo Jaunet
G. Antipov
M. Baccouche
Romain Vuillemot
Christian Wolf
LRM
23
28
0
08 Apr 2021
Multimodal Fusion Refiner Networks
Sethuraman Sankaran
David Yang
Ser-Nam Lim
OffRL
29
35
0
08 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
51
271
0
07 Apr 2021
Towards General Purpose Vision Systems
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
11
50
0
01 Apr 2021
Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation
Aviral Joshi
Chengzhi Huang
H. Singh
19
2
0
31 Mar 2021
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik
Zongze Wu
Eli Shechtman
Daniel Cohen-Or
Dani Lischinski
CLIP
VLM
37
1,192
0
31 Mar 2021
Diagnosing Vision-and-Language Navigation: What Really Matters
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Qing Guo
Qi Wu
Miguel P. Eckstein
Wei Wang
LM&Ro
27
50
0
30 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
25
136
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
30
120
0
30 Mar 2021
Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
Xiaosong Wang
Ziyue Xu
Leo K. Tam
Dong Yang
Daguang Xu
ViT
MedIm
22
23
0
30 Mar 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
Hila Chefer
Shir Gur
Lior Wolf
ViT
31
303
0
29 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
29
329
0
29 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
30
145
0
28 Mar 2021
Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models
Arijit Ray
Michael Cogswell
Xiaoyu Lin
Kamran Alipour
Ajay Divakaran
Yi Yao
Giedrius Burachas
FAtt
11
4
0
26 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
28
140
0
23 Mar 2021
Instance-level Image Retrieval using Reranking Transformers
Fuwen Tan
Jiangbo Yuan
Vicente Ordonez
ViT
28
89
0
22 Mar 2021
MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation
Zachary Seymour
Kowshik Thopalli
Niluthpol Chowdhury Mithun
Han-Pang Chiu
S. Samarasekera
Rakesh Kumar
3DPC
24
18
0
21 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
29
33
0
18 Mar 2021
Few-Shot Visual Grounding for Natural Human-Robot Interaction
Georgios Tziafas
S. Kasaei
27
6
0
17 Mar 2021
Multimodal End-to-End Sparse Model for Emotion Recognition
Wenliang Dai
Samuel Cahyawijaya
Zihan Liu
Pascale Fung
CVBM
13
79
0
17 Mar 2021
Predicting Opioid Use Disorder from Longitudinal Healthcare Data using Multi-stream Transformer
S. Fouladvand
J. Talbert
L. Dwoskin
H. Bush
A. Meadows
Lars E. Peterson
Ramakanth Kavuluru
Jin Chen
24
4
0
16 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
38
82
0
16 Mar 2021
A Survey on Multimodal Disinformation Detection
Firoj Alam
S. Cresci
Tanmoy Chakraborty
Fabrizio Silvestri
Dimiter Dimitrov
Giovanni Da San Martino
Shaden Shaar
Hamed Firooz
Preslav Nakov
18
98
0
13 Mar 2021
Unified Pre-training for Program Understanding and Generation
Wasi Uddin Ahmad
Saikat Chakraborty
Baishakhi Ray
Kai-Wei Chang
41
749
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
Andrew Shin
Masato Ishii
T. Narihira
35
37
0
06 Mar 2021
Causal Attention for Vision-Language Tasks
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
28
148
0
05 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
210
310
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Wei Lin
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
37
132
0
01 Mar 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts Go
Arnav Arora
Preslav Nakov
Momchil Hardalov
Sheikh Muhammad Sarwar
Vibha Nayak
...
Dimitrina Zlatkova
Kyle Dent
Ameya Bhatawdekar
Guillaume Bouchard
Isabelle Augenstein
33
46
0
27 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
25
295
0
22 Feb 2021
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Rafal Powalski
Łukasz Borchmann
Dawid Jurkiewicz
Tomasz Dwojak
Michal Pietruszka
Gabriela Pałka
ViT
36
157
0
18 Feb 2021
Hierarchical Similarity Learning for Language-based Product Image Retrieval
Zhe Ma
Fenghao Liu
Jianfeng Dong
Xiaoye Qu
Yuan He
S. Ji
VLM
16
4
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
299
1,084
0
17 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
Irwan Bello
281
179
0
17 Feb 2021
Biomedical Question Answering: A Survey of Approaches and Challenges
Qiao Jin
Zheng Yuan
Guangzhi Xiong
Qian Yu
Huaiyuan Ying
Chuanqi Tan
Mosha Chen
Songfang Huang
Xiaozhong Liu
Sheng Yu
26
95
0
10 Feb 2021
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
Linwei Ye
Mrigank Rochan
Zhi Liu
Xiaoqin Zhang
Yang Wang
VOS
EgoV
30
55
0
09 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
16
10
0
07 Feb 2021
Previous
1
2
3
...
20
21
22
23
24
Next