Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1909.11059
Cited By
Unified Vision-Language Pre-Training for Image Captioning and VQA
24 September 2019
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unified Vision-Language Pre-Training for Image Captioning and VQA"
50 / 523 papers shown
Title
BERTGEN: Multi-task Generation through BERT
Faidon Mitzalis
Ozan Caglayan
Pranava Madhyastha
Lucia Specia
VLM
27
7
0
07 Jun 2021
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
Haiyang Xu
Ming Yan
Chenliang Li
Bin Bi
Songfang Huang
Wenming Xiao
Fei Huang
VLM
31
118
0
03 Jun 2021
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers
Zhu Zhang
Jianxin Ma
Chang Zhou
Rui Men
Zhikang Li
Ming Ding
Jie Tang
Jingren Zhou
Hongxia Yang
34
46
0
29 May 2021
Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Jong Hak Moon
HyunGyung Lee
W. Shin
Young-Hak Kim
Edward Choi
MedIm
34
153
0
24 May 2021
Pretrained Language Models for Text Generation: A Survey
Junyi Li
Tianyi Tang
Wayne Xin Zhao
Ji-Rong Wen
LM&MA
VLM
SyDa
30
185
0
21 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
31
132
0
20 May 2021
Show Why the Answer is Correct! Towards Explainable AI using Compositional Temporal Attention
Nihar Bendre
K. Desai
Peyman Najafirad
CoGe
41
6
0
15 May 2021
Connecting What to Say With Where to Look by Modeling Human Attention Traces
Zihang Meng
Licheng Yu
Ning Zhang
Tamara L. Berg
Babak Damavandi
Vikas Singh
Amy Bearman
47
25
0
12 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic Survey
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Min Zhang
73
273
0
10 May 2021
Teaching a Massive Open Online Course on Natural Language Processing
Ekaterina Artemova
M. Apishev
V. Sarkisyan
Sergey Aksenov
D. Kirjanov
O. Serikov
VLM
19
4
0
26 Apr 2021
MusCaps: Generating Captions for Music Audio
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
64
36
0
24 Apr 2021
Playing Lottery Tickets with Vision and Language
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
114
54
0
23 Apr 2021
Understanding Synonymous Referring Expressions via Contrastive Features
Yi-Wen Chen
Yi-Hsuan Tsai
Ming-Hsuan Yang
ObjD
32
4
0
20 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
35
24
0
20 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
56
271
0
07 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
46
99
0
05 Apr 2021
Towards General Purpose Vision Systems
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
20
50
0
01 Apr 2021
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Mingyang Zhou
Luowei Zhou
Shuohang Wang
Yu Cheng
Linjie Li
Zhou Yu
Jingjing Liu
MLLM
VLM
36
89
0
01 Apr 2021
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning
Luowei Zhou
Jingjing Liu
Yu Cheng
Zhe Gan
Lei Zhang
31
7
0
01 Apr 2021
A Survey on Natural Language Video Localization
Xinfang Liu
Xiushan Nie
Zhifang Tan
Jie Guo
Yilong Yin
50
7
0
01 Apr 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
25
137
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
35
120
0
30 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
32
331
0
29 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
33
143
0
23 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
46
83
0
16 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
Andrew Shin
Masato Ishii
T. Narihira
40
38
0
06 Mar 2021
Causal Attention for Vision-Language Tasks
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
36
152
0
05 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
221
312
0
02 Mar 2021
CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation
A. Magassouba
K. Sugiura
Hisashi Kawai
55
10
0
01 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Wei Lin
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
37
133
0
01 Mar 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts Go
Arnav Arora
Preslav Nakov
Momchil Hardalov
Sheikh Muhammad Sarwar
Vibha Nayak
...
Dimitrina Zlatkova
Kyle Dent
Ameya Bhatawdekar
Guillaume Bouchard
Isabelle Augenstein
43
47
0
27 Feb 2021
Natural Language Video Localization: A Revisit in Span-based Question Answering Framework
Hao Zhang
Aixin Sun
Wei Jing
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
121
85
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
35
297
0
22 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
323
1,096
0
17 Feb 2021
Speech-language Pre-training for End-to-end Spoken Language Understanding
Yao Qian
Ximo Bian
Yu Shi
Naoyuki Kanda
Leo Shen
Zhen Xiao
Michael Zeng
AuLLM
32
46
0
11 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
66
650
0
11 Feb 2021
Iconographic Image Captioning for Artworks
E. Cetinic
32
24
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
77
1,719
0
05 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho
Jie Lei
Hao Tan
Joey Tianyi Zhou
MLLM
277
529
0
04 Feb 2021
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games
Alessandro Suglia
Yonatan Bisk
Ioannis Konstas
Antonio Vergari
E. Bastianelli
Andrea Vanzo
Oliver Lemon
26
8
0
31 Jan 2021
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Xudong Lin
Gedas Bertasius
Jue Wang
Shih-Fu Chang
Devi Parikh
Lorenzo Torresani
VGen
38
66
0
28 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
29
52
0
27 Jan 2021
VisualMRC: Machine Reading Comprehension on Document Images
Ryota Tanaka
Kyosuke Nishida
Sen Yoshida
34
141
0
27 Jan 2021
Dual-Level Collaborative Transformer for Image Captioning
Yunpeng Luo
Jiayi Ji
Xiaoshuai Sun
Liujuan Cao
Yongjian Wu
Feiyue Huang
Chia-Wen Lin
Rongrong Ji
ViT
19
274
0
16 Jan 2021
Latent Variable Models for Visual Question Answering
Zixu Wang
Yishu Miao
Lucia Specia
32
5
0
16 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
11
28
0
15 Jan 2021
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
233
2,445
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjD
VLM
272
157
0
02 Jan 2021
KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation
Yiran Xing
Z. Shi
Zhao Meng
Gerhard Lakemeyer
Yunpu Ma
Roger Wattenhofer
VLM
74
40
0
02 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
31
375
0
31 Dec 2020
Previous
1
2
3
...
10
11
8
9
Next