Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 512 papers shown
Title
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Aisha Urooj Khan
Amir Mazaheri
N. Lobo
M. Shah
97
57
0
27 Oct 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li
Haoxuan You
Zhecan Wang
Alireza Zareian
Shih-Fu Chang
Kai-Wei Chang
SSL
VLM
101
12
0
24 Oct 2020
ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding
Minjeong Kim
Gyuwan Kim
Sang-Woo Lee
Jung-Woo Ha
VLM
71
36
0
23 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
101
6
0
19 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning
Wanyun Cui
Guangyu Zheng
Wei Wang
SSL
52
21
0
16 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
Fuli Luo
Pengcheng Yang
Shicheng Li
Xuancheng Ren
Xu Sun
VLM
SSL
73
16
0
13 Oct 2020
Contrast and Classify: Training Robust VQA Models
Yash Kant
A. Moudgil
Dhruv Batra
Devi Parikh
Harsh Agrawal
55
5
0
13 Oct 2020
Beyond Language: Learning Commonsense from Images for Reasoning
Wanqing Cui
Yanyan Lan
Liang Pang
Jiafeng Guo
Xueqi Cheng
LRM
71
5
0
10 Oct 2020
Learning to Represent Image and Text with Denotation Graph
Bowen Zhang
Hexiang Hu
Vihan Jain
Eugene Ie
Fei Sha
78
22
0
06 Oct 2020
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
108
249
0
06 Oct 2020
Multi-Modal Open-Domain Dialogue
Kurt Shuster
Eric Michael Smith
Da Ju
Jason Weston
AI4CE
137
44
0
02 Oct 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLM
MLLM
95
102
0
23 Sep 2020
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering
Tejas Gokhale
Pratyay Banerjee
Chitta Baral
Yezhou Yang
OOD
56
142
0
18 Sep 2020
A Multimodal Memes Classification: A Survey and Open Research Issues
Tariq Habib Afridi
A. Alam
Muhammad Numan Khan
Jawad Khan
Young-Koo Lee
55
41
0
17 Sep 2020
Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models
Khyathi Chandu
Piyush Sharma
Soravit Changpinyo
Ashish V. Thapliyal
Radu Soricut
DiffM
VLM
84
3
0
10 Sep 2020
Active Contrastive Learning of Audio-Visual Video Representations
Shuang Ma
Zhaoyang Zeng
Daniel J. McDuff
Yale Song
VLM
SSL
60
8
0
31 Aug 2020
DeVLBert: Learning Deconfounded Visio-Linguistic Representations
Shengyu Zhang
Tan Jiang
Tan Wang
Kun Kuang
Zhou Zhao
Jianke Zhu
Jin Yu
Hongxia Yang
Leilei Gan
OOD
81
88
0
16 Aug 2020
Weakly supervised cross-domain alignment with optimal transport
Siyang Yuan
Ke Bai
Liqun Chen
Yizhe Zhang
Chenyang Tao
Chunyuan Li
Guoyin Wang
Ricardo Henao
Lawrence Carin
OT
60
7
0
14 Aug 2020
SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space
Liu Yang
VLM
60
5
0
02 Aug 2020
Spatially Aware Multimodal Transformers for TextVQA
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
100
86
0
23 Jul 2020
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
Wanrong Zhu
Xinze Wang
Tsu-Jui Fu
An Yan
P. Narayana
Kazoo Sone
Sugato Basu
Wenjie Wang
93
34
0
01 Jul 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
128
382
0
30 Jun 2020
Video-Grounded Dialogues with Pretrained Generation Language Models
Hung Le
Guosheng Lin
82
28
0
27 Jun 2020
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Saeed Amizadeh
Hamid Palangi
Oleksandr Polozov
Yichen Huang
K. Koishida
NAI
LRM
119
60
0
20 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjD
SSL
168
144
0
17 Jun 2020
VirTex: Learning Visual Representations from Textual Annotations
Karan Desai
Justin Johnson
SSL
VLM
171
437
0
11 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
133
501
0
11 Jun 2020
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
Minheng Ni
Haoyang Huang
Lin Su
Edward Cui
Taroon Bharti
Lijuan Wang
Jianfeng Gao
Dongdong Zhang
Nan Duan
56
7
0
04 Jun 2020
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
D. Gao
Linbo Jin
Ben Chen
Minghui Qiu
Peng Li
Yi Wei
Yitao Hu
Haozhe Jasper Wang
OOD
84
134
0
20 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
115
130
0
15 May 2020
Cross-media Structured Common Space for Multimedia Event Extraction
Manling Li
Alireza Zareian
Qi Zeng
Spencer Whitehead
Di Lu
Heng Ji
Shih-Fu Chang
80
103
0
05 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
133
507
0
01 May 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
193
236
0
30 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Yue Wang
Shafiq Joty
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
114
104
0
28 Apr 2020
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh
Vedanuj Goswami
Devi Parikh
VLM
78
48
0
19 Apr 2020
Relation Transformer Network
Rajat Koner
Poulami Sinhamahapatra
Volker Tresp
ViT
107
33
0
13 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
204
1,953
0
13 Apr 2020
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
Yaobo Liang
Nan Duan
Yeyun Gong
Ning Wu
Fenfei Guo
...
Shuguang Liu
Fan Yang
Daniel Fernando Campos
Rangan Majumder
Ming Zhou
ELM
VLM
115
350
0
03 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
187
440
0
02 Apr 2020
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
395
1,500
0
18 Mar 2020
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
Hui Chen
Guiguang Ding
Xudong Liu
Zijia Lin
Ji Liu
Jungong Han
78
326
0
08 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLM
VLM
103
76
0
03 Mar 2020
Unshuffling Data for Improved Generalization
Damien Teney
Ehsan Abbasnejad
Anton Van Den Hengel
OOD
77
78
0
27 Feb 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
59
6
0
25 Feb 2020
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
Weituo Hao
Chunyuan Li
Xiujun Li
Lawrence Carin
Jianfeng Gao
LM&Ro
119
282
0
25 Feb 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Huaishao Luo
Lei Ji
Botian Shi
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
VLM
130
438
0
15 Feb 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
127
263
0
22 Jan 2020
All-in-One Image-Grounded Conversational Agents
Da Ju
Kurt Shuster
Y-Lan Boureau
Jason Weston
LLMAG
85
8
0
28 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Vishvak Murahari
Dhruv Batra
Devi Parikh
Abhishek Das
VLM
111
117
0
05 Dec 2019
12-in-1: Multi-Task Vision and Language Representation Learning
Jiasen Lu
Vedanuj Goswami
Marcus Rohrbach
Devi Parikh
Stefan Lee
VLM
ObjD
131
481
0
05 Dec 2019
Previous
1
2
3
...
10
11
9
Next