Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 1,200 papers shown
Title
Panoptic Narrative Grounding
Cristina González
Nicolás Ayobi
Isabela Hernández
José Hernández
Jordi Pont-Tuset
Pablo Arbeláez
146
23
0
10 Sep 2021
Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation
H. Khan
D. Gupta
Asif Ekbal
57
14
0
10 Sep 2021
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Stella Frank
Emanuele Bugliarello
Desmond Elliott
74
82
0
09 Sep 2021
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining
Xiao Dong
Xunlin Zhan
Yangxin Wu
Yunchao Wei
Michael C. Kampffmeyer
Xiaoyong Wei
Minlong Lu
Yaowei Wang
Xiaodan Liang
116
38
0
09 Sep 2021
Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
Man Luo
Yankai Zeng
Pratyay Banerjee
Chitta Baral
RALM
131
66
0
09 Sep 2021
Towards Natural Language Interfaces for Data Visualization: A Survey
Leixian Shen
Enya Shen
Yuyu Luo
Xiaocong Yang
Xuming Hu
Xiongshuai Zhang
Zhiwei Tai
Jianmin Wang
111
146
0
08 Sep 2021
CTRL-C: Camera calibration TRansformer with Line-Classification
Jinwoo Lee
Hyun-Young Go
Hyunjoon Lee
Sunghyun Cho
Minhyuk Sung
Junho Kim
ViT
87
36
0
06 Sep 2021
Data Efficient Masked Language Modeling for Vision and Language
Yonatan Bitton
Gabriel Stanovsky
Michael Elhadad
Roy Schwartz
VLM
82
20
0
05 Sep 2021
Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment
Zhanghexuan Ji
Mohammad Abuzar Shaikh
Dana Moukheiber
S. Srihari
Yifan Peng
Mingchen Gao
SSL
92
21
0
04 Sep 2021
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Shaikh
Zhanghexuan Ji
Dana Moukheiber
Yan Shen
S. Srihari
Mingchen Gao
VLM
44
1
0
04 Sep 2021
Multimodal Conditionality for Natural Language Generation
Michael Sollami
Aashish Jain
73
10
0
02 Sep 2021
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Hang Li
Yunxing Kang
Tianqiao Liu
Wenbiao Ding
Zitao Liu
71
19
0
01 Sep 2021
Fine-Grained Chemical Entity Typing with Multimodal Knowledge Representation
Chenkai Sun
Weijian Li
Jinfeng Xiao
Nikolaus Nova Parulian
ChengXiang Zhai
Heng Ji
79
4
0
29 Aug 2021
Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
Yuqing Song
Shizhe Chen
Qin Jin
Wei Luo
Jun Xie
Fei Huang
101
20
0
25 Aug 2021
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
VLM
MLLM
153
799
0
24 Aug 2021
Knowledge Perceived Multi-modal Pretraining in E-commerce
Yushan Zhu
Huaixiao Tou
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen
92
33
0
20 Aug 2021
Who's Waldo? Linking People Across Text and Images
Claire Yuqing Cui
Apoorv Khandelwal
Yoav Artzi
Noah Snavely
Hadar Averbuch-Elor
85
21
0
16 Aug 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
59
56
0
16 Aug 2021
Metadata-based Multi-Task Bandits with Bayesian Hierarchical Models
Runzhe Wan
Linjuan Ge
Rui Song
75
29
0
13 Aug 2021
BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis
Masoud Monajatipoor
Mozhdeh Rouhsedaghat
Liunian Harold Li
Aichi Chien
C.-C. Jay Kuo
Fabien Scalzo
Kai-Wei Chang
LM&MA
MedIm
60
31
0
10 Aug 2021
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Alessandro Suglia
Qiaozi Gao
Jesse Thomason
Govind Thattai
Gaurav Sukhatme
LM&Ro
127
78
0
10 Aug 2021
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
Zheyuan Liu
Cristian Rodriguez-Opazo
Damien Teney
Stephen Gould
VLM
79
206
0
09 Aug 2021
Disentangling Hate in Online Memes
Rui Cao
Ziqing Fan
Roy Ka-wei Lee
Wen-Haw Chong
Jing Jiang
65
81
0
09 Aug 2021
Detecting Propaganda Techniques in Memes
Dimitar Dimitrov
Bishr Bin Ali
Shaden Shaar
Firoj Alam
Fabrizio Silvestri
Hamed Firooz
Preslav Nakov
Giovanni Da San Martino
87
95
0
07 Aug 2021
Vision Transformer with Progressive Sampling
Xiaoyu Yue
Shuyang Sun
Zhanghui Kuang
Meng Wei
Philip Torr
Wayne Zhang
Dahua Lin
ViT
89
85
0
03 Aug 2021
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
Rinon Gal
Or Patashnik
Haggai Maron
Gal Chechik
Daniel Cohen-Or
CLIP
VLM
90
232
0
02 Aug 2021
Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding
Heng Zhao
Qiufeng Wang
Yew-Soon Ong
ObjD
74
26
0
31 Jul 2021
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining
Xunlin Zhan
Yangxin Wu
Xiao Dong
Yunchao Wei
Minlong Lu
Yichi Zhang
Hang Xu
Xiaodan Liang
ViT
92
67
0
30 Jul 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
107
143
0
29 Jul 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
76
2
0
27 Jul 2021
Language Grounding with 3D Objects
Jesse Thomason
Mohit Shridhar
Yonatan Bisk
Chris Paxton
Luke Zettlemoyer
LM&Ro
88
53
0
26 Jul 2021
Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Tongtong Liu
Fangxiang Feng
Xiaojie Wang
38
13
0
22 Jul 2021
DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework
Haiwen Hong
Xuan Jin
Yin Zhang
Yunqing Hu
Jingfeng Zhang
Yuan He
Hui Xue
MoE
34
0
0
21 Jul 2021
Separating Skills and Concepts for Novel Visual Question Answering
Spencer Whitehead
Hui Wu
Heng Ji
Rogerio Feris
Kate Saenko
CoGe
95
34
0
19 Jul 2021
Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images
Nyoungwoo Lee
Suwon Shin
Jaegul Choo
Ho‐Jin Choi
S. Myaeng
60
27
0
19 Jul 2021
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
324
1,986
0
16 Jul 2021
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Zetian Wu
Yun Cheng
...
Peter Wu
Michelle A. Lee
Yuke Zhu
Ruslan Salakhutdinov
Louis-Philippe Morency
VLM
111
172
0
15 Jul 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Joey Tianyi Zhou
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIP
VLM
MLLM
270
412
0
13 Jul 2021
Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
Ruihan Yang
Minghao Zhang
Nicklas Hansen
Huazhe Xu
Xiaolong Wang
OffRL
85
108
0
08 Jul 2021
IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task
Baban Gain
Dibyanayan Bandyopadhyay
Asif Ekbal
51
9
0
04 Jul 2021
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
167
101
0
01 Jul 2021
GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph
Yunxin Li
Yu Zhao
Baotian Hu
Qingcai Chen
Yang Xiang
Xiaolong Wang
Yuxin Ding
Lin Ma
44
7
0
01 Jul 2021
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Jing Liu
Xinxin Zhu
Fei Liu
Longteng Guo
Zijia Zhao
...
Weining Wang
Hanqing Lu
Shiyu Zhou
Jiajun Zhang
Jinqiao Wang
82
38
0
01 Jul 2021
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
114
574
0
30 Jun 2021
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo
94
89
0
25 Jun 2021
A Picture May Be Worth a Hundred Words for Visual Question Answering
Yusuke Hirota
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Ittetsu Taniguchi
Takao Onoye
ViT
35
4
0
25 Jun 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021
Keda Lu
Bo Fang
Kuan-Yu Chen
ViT
38
2
0
24 Jun 2021
DocFormer: End-to-End Transformer for Document Understanding
Srikar Appalaraju
Bhavan A. Jasani
Bhargava Urala Kota
Yusheng Xie
R. Manmatha
ViT
117
281
0
22 Jun 2021
AOMD: An Analogy-aware Approach to Offensive Meme Detection on Social Media
Lanyu Shang
Yang Zhang
Yuheng Zha
Yingxi Chen
Christina Youn
Dong Wang
51
27
0
21 Jun 2021
Efficient Self-supervised Vision Transformers for Representation Learning
Chunyuan Li
Jianwei Yang
Pengchuan Zhang
Mei Gao
Bin Xiao
Xiyang Dai
Lu Yuan
Jianfeng Gao
ViT
108
214
0
17 Jun 2021
Previous
1
2
3
...
19
20
21
22
23
24
Next