Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,118 papers shown
Title
Multimodal Categorization of Crisis Events in Social Media
Mahdi Abavisani
Liwei Wu
Shengli Hu
Joel R. Tetreault
A. Jaimes
98
88
0
10 Apr 2020
Learning to Scale Multilingual Representations for Vision-Language Tasks
Andrea Burns
Donghyun Kim
Derry Wijaya
Kate Saenko
Bryan A. Plummer
50
35
0
09 Apr 2020
Context-Aware Group Captioning via Self-Attention and Contrastive Features
Zhuowan Li
Quan Hung Tran
Long Mai
Zhe Lin
Alan Yuille
VLM
81
44
0
07 Apr 2020
TAPAS: Weakly Supervised Table Parsing via Pre-training
Jonathan Herzig
Pawel Krzysztof Nowak
Thomas Müller
Francesco Piccinno
Julian Martin Eisenschlos
LMTD
RALM
176
658
0
05 Apr 2020
Generating Rationales in Visual Question Answering
Hammad A. Ayyubi
Md. Mehrab Tanjim
Julian McAuley
G. Cottrell
LRM
47
6
0
04 Apr 2020
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
Yaobo Liang
Nan Duan
Yeyun Gong
Ning Wu
Fenfei Guo
...
Shuguang Liu
Fan Yang
Daniel Fernando Campos
Rangan Majumder
Ming Zhou
ELM
VLM
115
351
0
03 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
197
440
0
02 Apr 2020
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
J. Liu
Wenhu Chen
Yu Cheng
Zhe Gan
Licheng Yu
Yiming Yang
Jingjing Liu
MLLM
VGen
104
70
0
25 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
422
1,500
0
18 Mar 2020
Deconfounded Image Captioning: A Causal Retrospect
Xu Yang
Hanwang Zhang
Jianfei Cai
CML
79
127
0
09 Mar 2020
Cross-modal Learning for Multi-modal Video Categorization
Palash Goyal
Saurabh Sahu
Shalini Ghosh
Chul Lee
79
9
0
07 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLM
VLM
105
76
0
03 Mar 2020
Visual Commonsense R-CNN
Tan Wang
Jianqiang Huang
Hanwang Zhang
Qianru Sun
SSL
ObjD
CML
86
252
0
27 Feb 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
59
6
0
25 Feb 2020
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
Weituo Hao
Chunyuan Li
Xiujun Li
Lawrence Carin
Jianfeng Gao
LM&Ro
125
283
0
25 Feb 2020
Measuring Social Biases in Grounded Vision and Language Embeddings
Candace Ross
Boris Katz
Andrei Barbu
103
65
0
20 Feb 2020
Contextual Lensing of Universal Sentence Representations
J. Kiros
57
5
0
20 Feb 2020
VQA-LOL: Visual Question Answering under the Lens of Logic
Tejas Gokhale
Pratyay Banerjee
Chitta Baral
Yezhou Yang
CoGe
71
75
0
19 Feb 2020
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng
Daya Guo
Duyu Tang
Nan Duan
Xiaocheng Feng
...
Linjun Shou
Bing Qin
Ting Liu
Daxin Jiang
Ming Zhou
230
2,727
0
19 Feb 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Huaishao Luo
Lei Ji
Botian Shi
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
VLM
132
438
0
15 Feb 2020
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
Jesse Dodge
Gabriel Ilharco
Roy Schwartz
Ali Farhadi
Hannaneh Hajishirzi
Noah A. Smith
107
598
0
15 Feb 2020
Exploiting Temporal Coherence for Multi-modal Video Categorization
Palash Goyal
Saurabh Sahu
Shalini Ghosh
Chul Lee
36
1
0
07 Feb 2020
Retrospective Reader for Machine Reading Comprehension
Zhuosheng Zhang
Junjie Yang
Hai Zhao
RALM
104
227
0
27 Jan 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
134
263
0
22 Jan 2020
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
M. Farazi
Salman H. Khan
Nick Barnes
79
18
0
20 Jan 2020
In Defense of Grid Features for Visual Question Answering
Huaizu Jiang
Ishan Misra
Marcus Rohrbach
Erik Learned-Miller
Xinlei Chen
OOD
ObjD
88
320
0
10 Jan 2020
Visual Question Answering on 360° Images
Shih-Han Chou
Wei-Lun Chao
Wei-Sheng Lai
Min Sun
Ming-Hsuan Yang
54
22
0
10 Jan 2020
Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering
Lei Shi
Shijie Geng
Kai Shuang
Chiori Hori
Songxiang Liu
Peng Gao
Sen Su
85
11
0
03 Jan 2020
All-in-One Image-Grounded Conversational Agents
Da Ju
Kurt Shuster
Y-Lan Boureau
Jason Weston
LLMAG
85
8
0
28 Dec 2019
Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
Sara Beery
Guanhang Wu
V. Rathod
Ronny Votel
Jonathan Huang
ObjD
109
116
0
07 Dec 2019
Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks
Corentin Kervadec
G. Antipov
M. Baccouche
Christian Wolf
60
15
0
06 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Vishvak Murahari
Dhruv Batra
Devi Parikh
Abhishek Das
VLM
115
117
0
05 Dec 2019
15 Keypoints Is All You Need
Michael Snower
Asim Kadav
Farley Lai
H. Graf
VOT
3DH
96
48
0
05 Dec 2019
12-in-1: Multi-Task Vision and Language Representation Learning
Jiasen Lu
Vedanuj Goswami
Marcus Rohrbach
Devi Parikh
Stefan Lee
VLM
ObjD
131
481
0
05 Dec 2019
Multimodal Attention Networks for Low-Level Vision-and-Language Navigation
Federico Landi
Lorenzo Baraldi
Marcella Cornia
M. Corsini
Rita Cucchiara
LM&Ro
87
29
0
27 Nov 2019
Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs
Van-Quang Nguyen
Masanori Suganuma
Takayuki Okatani
107
7
0
26 Nov 2019
Learning to Learn Words from Visual Scenes
Dídac Surís
Dave Epstein
Heng Ji
Shih-Fu Chang
Carl Vondrick
VLM
CLIP
SSL
OffRL
72
4
0
25 Nov 2019
Temporal Reasoning via Audio Question Answering
Haytham M. Fayek
Justin Johnson
65
54
0
21 Nov 2019
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
Fengda Zhu
Yi Zhu
Xiaojun Chang
Xiaodan Liang
LRM
115
244
0
18 Nov 2019
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Ronghang Hu
Amanpreet Singh
Trevor Darrell
Marcus Rohrbach
98
197
0
14 Nov 2019
Unsupervised Pre-training for Natural Language Generation: A Literature Review
Yuanxin Liu
Zheng Lin
SSL
AI4CE
45
3
0
13 Nov 2019
The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
J. Dean
60
79
0
13 Nov 2019
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
Chao Zhang
Zichao Yang
Xiaodong He
Li Deng
HAI
AI4TS
122
338
0
10 Nov 2019
Two-Headed Monster And Crossed Co-Attention Networks
Yaoyiran Li
Jing Jiang
64
0
0
10 Nov 2019
The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
Kurt Shuster
Da Ju
Stephen Roller
Emily Dinan
Y-Lan Boureau
Jason Weston
112
82
0
09 Nov 2019
Probing Contextualized Sentence Representations with Visual Awareness
Zhuosheng Zhang
Rui Wang
Kehai Chen
Masao Utiyama
Eiichiro Sumita
Hai Zhao
77
2
0
07 Nov 2019
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
Jingxiang Lin
Unnat Jain
Alex Schwing
LRM
ReLM
105
9
0
31 Oct 2019
Text Matters but Speech Influences: A Computational Analysis of Syntactic Ambiguity Resolution
Won Ik Cho
J. Cho
Woohyun Kang
N. Kim
88
2
0
21 Oct 2019
Meta Module Network for Compositional Visual Reasoning
Wenhu Chen
Zhe Gan
Linjie Li
Yu Cheng
Wenjie Wang
Jingjing Liu
LRM
93
71
0
08 Oct 2019
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval
Reuben Tan
Huijuan Xu
Kate Saenko
Bryan A. Plummer
99
68
0
27 Sep 2019
Previous
1
2
3
...
41
42
43
Next