Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1909.11059
Cited By
Unified Vision-Language Pre-Training for Image Captioning and VQA
24 September 2019
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unified Vision-Language Pre-Training for Image Captioning and VQA"
50 / 523 papers shown
Title
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
27
0
0
30 Dec 2020
Detecting Hateful Memes Using a Multimodal Deep Ensemble
Vlad Sandulescu
VLM
31
43
0
24 Dec 2020
A Survey on Visual Transformer
Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
...
Chunjing Xu
Yixing Xu
Zhaohui Yang
Yiman Zhang
Dacheng Tao
ViT
28
2,157
0
23 Dec 2020
Object-Centric Diagnosis of Visual Reasoning
Jianwei Yang
Jiayuan Mao
Jiajun Wu
Devi Parikh
David D. Cox
J. Tenenbaum
Chuang Gan
OCL
29
16
0
21 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
36
180
0
20 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
40
42
0
15 Dec 2020
MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
Lefei Zhang
Jianfeng Gao
Zicheng Liu
VLM
MLLM
40
60
0
13 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
38
142
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
39
76
0
08 Dec 2020
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da
Maxwell Forbes
Rowan Zellers
Anthony Zheng
Jena D. Hwang
Antoine Bosselut
Yejin Choi
DiffM
30
13
0
08 Dec 2020
Understanding Guided Image Captioning Performance across Domains
Edwin G. Ng
Bo Pang
P. Sharma
Radu Soricut
40
24
0
04 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
28
15
0
02 Dec 2020
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Emanuele Bugliarello
Ryan Cotterell
Naoaki Okazaki
Desmond Elliott
40
119
0
30 Nov 2020
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Hassan Akbari
Hamid Palangi
Jianwei Yang
Sudha Rao
Asli Celikyilmaz
Roland Fernandez
P. Smolensky
Jianfeng Gao
Shih-Fu Chang
49
3
0
18 Nov 2020
A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges
Moloud Abdar
Farhad Pourpanah
Sadiq Hussain
Dana Rezazadegan
Li Liu
...
Xiaochun Cao
Abbas Khosravi
U. Acharya
V. Makarenkov
S. Nahavandi
BDL
UQCV
74
1,888
0
12 Nov 2020
CapWAP: Captioning with a Purpose
Adam Fisch
Kenton Lee
Ming-Wei Chang
J. Clark
Regina Barzilay
27
11
0
09 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings
Yue Wang
Jing Li
Michael R. Lyu
Irwin King
29
16
0
03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
43
170
0
01 Nov 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li
Haoxuan You
Zhecan Wang
Alireza Zareian
Shih-Fu Chang
Kai-Wei Chang
SSL
VLM
72
12
0
24 Oct 2020
Show and Speak: Directly Synthesize Spoken Description of Images
Xinsheng Wang
Siyuan Feng
Jihua Zhu
M. Hasegawa-Johnson
O. Scharenborg
46
4
0
23 Oct 2020
Learning Dual Semantic Relations with Graph Attention for Image-Text Matching
Keyu Wen
Xiaodong Gu
Qingrong Cheng
32
95
0
22 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
36
6
0
19 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering
Hantao Huang
Tao Han
Wei Han
D. Yap
Cheng-Ming Chiang
21
2
0
17 Oct 2020
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Ana Marasović
Chandra Bhagavatula
J. S. Park
Ronan Le Bras
Noah A. Smith
Yejin Choi
ReLM
LRM
32
62
0
15 Oct 2020
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Hao Tan
Joey Tianyi Zhou
CLIP
22
121
0
14 Oct 2020
MedICaT: A Dataset of Medical Images, Captions, and Textual References
Sanjay Subramanian
Lucy Lu Wang
Sachin Mehta
Ben Bogin
Madeleine van Zuylen
Sravanthi Parasa
Sameer Singh
Matt Gardner
Hannaneh Hajishirzi
MedIm
14
69
0
12 Oct 2020
Attention Guided Semantic Relationship Parsing for Visual Question Answering
M. Farazi
Salman Khan
Nick Barnes
19
2
0
05 Oct 2020
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
Xiaowei Hu
Xi Yin
Kevin Qinghong Lin
Lijuan Wang
Lefei Zhang
Jianfeng Gao
Zicheng Liu
VLM
32
56
0
28 Sep 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLM
MLLM
35
102
0
23 Sep 2020
A Multimodal Memes Classification: A Survey and Open Research Issues
Tariq Habib Afridi
A. Alam
Muhammad Numan Khan
Jawad Khan
Young-Koo Lee
34
37
0
17 Sep 2020
Multi-modal Summarization for Video-containing Documents
Xiyan Fu
Jun Wang
Zhenglu Yang
33
23
0
17 Sep 2020
Polysemy Deciphering Network for Robust Human-Object Interaction Detection
Xubin Zhong
Changxing Ding
X. Qu
Dacheng Tao
37
58
0
07 Aug 2020
Learning Visual Representations with Caption Annotations
Mert Bulent Sariyildiz
J. Perez
Diane Larlus
VLM
SSL
27
159
0
04 Aug 2020
Spatially Aware Multimodal Transformers for TextVQA
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
29
85
0
23 Jul 2020
Length-Controllable Image Captioning
Chaorui Deng
Ning Ding
Mingkui Tan
Qi Wu
VLM
38
56
0
19 Jul 2020
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Yingwei Pan
Yehao Li
Jianjie Luo
Jun Xu
Ting Yao
Tao Mei
43
57
0
05 Jul 2020
Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds
Jack Culpepper
Simão Herdade
Yang Zhang
K. Boakye
EgoV
26
72
0
30 Jun 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
36
377
0
30 Jun 2020
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Saeed Amizadeh
Hamid Palangi
Oleksandr Polozov
Yichen Huang
K. Koishida
NAI
LRM
44
58
0
20 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjD
SSL
52
141
0
17 Jun 2020
VirTex: Learning Visual Representations from Textual Annotations
Karan Desai
Justin Johnson
SSL
VLM
35
433
0
11 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
35
489
0
11 Jun 2020
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
Minheng Ni
Haoyang Huang
Lin Su
Edward Cui
Taroon Bharti
Lijuan Wang
Jianfeng Gao
Dongdong Zhang
Nan Duan
29
7
0
04 Jun 2020
Emergent Multi-Agent Communication in the Deep Learning Era
Angeliki Lazaridou
Marco Baroni
AI4CE
53
199
0
03 Jun 2020
Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation
Boris Knyazev
H. D. Vries
Cătălina Cangea
Graham W. Taylor
Aaron Courville
Eugene Belilovsky
32
56
0
17 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
35
127
0
15 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
71
496
0
01 May 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
57
230
0
30 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Yue Wang
Shafiq Joty
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
34
102
0
28 Apr 2020
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
J. S. Park
Chandra Bhagavatula
Roozbeh Mottaghi
Ali Farhadi
Yejin Choi
ReLM
LRM
27
6
0
22 Apr 2020
Previous
1
2
3
...
10
11
9
Next