Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1909.11059
Cited By
Unified Vision-Language Pre-Training for Image Captioning and VQA
24 September 2019
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unified Vision-Language Pre-Training for Image Captioning and VQA"
50 / 523 papers shown
Title
Learning Granularity-Unified Representations for Text-to-Image Person Re-identification
Zhiyin Shao
Xinyu Zhang
Meng Fang
Zhi-hao Lin
Jian Wang
Changxing Ding
44
99
0
16 Jul 2022
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang
F. Xia
Ted Xiao
Harris Chan
Jacky Liang
...
Tomas Jackson
Linda Luu
Sergey Levine
Karol Hausman
Brian Ichter
LLMAG
LM&Ro
LRM
51
873
0
12 Jul 2022
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLM
CLIP
32
2
0
05 Jul 2022
ZoDIAC: Zoneout Dropout Injection Attention Calculation
Zanyar Zohourianshahzadi
Jugal Kalita
41
0
0
28 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
51
64
0
17 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
66
230
0
16 Jun 2022
Multimodal Dialogue State Tracking
Hung Le
Nancy F. Chen
Guosheng Lin
35
9
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
54
124
0
15 Jun 2022
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
Tuan Dinh
Yuchen Zeng
Ruisu Zhang
Ziqian Lin
Michael Gira
Shashank Rajput
Jy-yong Sohn
Dimitris Papailiopoulos
Kangwook Lee
LMTD
69
130
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Peng Xu
Xiatian Zhu
David Clifton
ViT
82
538
0
13 Jun 2022
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
35
96
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjD
VLM
30
292
0
12 Jun 2022
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training
Zhihao Fan
Zhongyu Wei
Jingjing Chen
Siyuan Wang
Zejun Li
Jiarong Xu
Xuanjing Huang
CLL
22
6
0
11 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMe
MoE
39
66
0
09 Jun 2022
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Kshitij Gupta
Devansh Gautam
R. Mamidi
VLM
41
3
0
07 Jun 2022
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
Xudong Lin
Simran Tiwari
Shiyuan Huang
Manling Li
Mike Zheng Shou
Heng Ji
Shih-Fu Chang
42
20
0
05 Jun 2022
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Yujia Xie
Luowei Zhou
Xiyang Dai
Lu Yuan
Nguyen Bach
Ce Liu
Michael Zeng
VLM
MLLM
37
28
0
03 Jun 2022
CLIP4IDC: CLIP for Image Difference Captioning
Zixin Guo
Tong Wang
Jorma T. Laaksonen
VLM
34
28
0
01 Jun 2022
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Pengyuan Lyu
Chengquan Zhang
Shanshan Liu
Meina Qiao
Yangliu Xu
Liang Wu
Kun Yao
Junyu Han
Errui Ding
Jingdong Wang
54
42
0
01 Jun 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
68
532
0
27 May 2022
Multimodal Knowledge Alignment with Reinforcement Learning
Youngjae Yu
Jiwan Chung
Heeseung Yun
Jack Hessel
Jinho Park
...
Prithviraj Ammanabrolu
Rowan Zellers
Ronan Le Bras
Gunhee Kim
Yejin Choi
VLM
123
36
0
25 May 2022
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Jingnong Qu
Liunian Harold Li
Jieyu Zhao
Sunipa Dev
Kai-Wei Chang
26
12
0
25 May 2022
ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest
Paul Baltescu
Haoyu Chen
Nikil Pancha
Andrew Zhai
J. Leskovec
Charles R. Rosenberg
DML
45
33
0
24 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
50
4
0
24 May 2022
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang
Manling Li
Ruochen Xu
Luowei Zhou
Jie Lei
...
Chenguang Zhu
Derek Hoiem
Shih-Fu Chang
Joey Tianyi Zhou
Heng Ji
MLLM
VLM
172
138
0
22 May 2022
Novel Multicolumn Kernel Extreme Learning Machine for Food Detection via Optimal Features from CNN
G. Tahir
Tahir Chu
C. K. Loo
29
2
0
15 May 2022
Localized Vision-Language Matching for Open-vocabulary Object Detection
M. A. Bravo
Sudhanshu Mittal
Thomas Brox
VLM
ObjD
20
25
0
12 May 2022
Automated Audio Captioning: An Overview of Recent Progress and New Challenges
Xinhao Mei
Xubo Liu
Mark D. Plumbley
Wenwu Wang
49
38
0
12 May 2022
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
48
34
0
10 May 2022
RoViST:Learning Robust Metrics for Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
30
8
0
08 May 2022
From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data
Zhenghang Yuan
Lichao Mou
Q. Wang
Xiao Xiang Zhu
32
62
0
06 May 2022
Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su
Tian Lan
Yahui Liu
Fangyu Liu
Dani Yogatama
Yan Wang
Lingpeng Kong
Nigel Collier
VLM
MLLM
64
97
0
05 May 2022
Cross-modal Contrastive Learning for Speech Translation
Rong Ye
Mingxuan Wang
Lei Li
SSL
36
84
0
05 May 2022
All You May Need for VQA are Image Captions
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
32
71
0
04 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
22
16
0
02 May 2022
Molecular Identification from AFM images using the IUPAC Nomenclature and Attribute Multimodal Recurrent Neural Networks
Jaime Carracedo-Cosme
C. Romero-Muñiz
Pablo Pou
R. Pérez
27
10
0
01 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
63
3,403
0
29 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
26
9
0
23 Apr 2022
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Xiyang Dai
...
Jianwei Yang
Haoxuan You
Kai-Wei Chang
Shih-Fu Chang
Lu Yuan
VLM
OffRL
35
22
0
22 Apr 2022
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Mustafa Shukor
Guillaume Couairon
Asya Grechka
Matthieu Cord
ViT
52
18
0
20 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
22
43
0
16 Apr 2022
Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer
Hyungyu Lee
Sungjin Park
Joonseok Lee
Edward Choi
35
2
0
15 Apr 2022
Guiding Attention using Partial-Order Relationships for Image Captioning
Murad Popattia
Muhammad Rafi
Rizwan Qureshi
Shah Nawaz
21
5
0
15 Apr 2022
Image Captioning In the Transformer Age
Yangliu Xu
Li Li
Haiyang Xu
Songfang Huang
Fei Huang
Jianfei Cai
ViT
27
5
0
15 Apr 2022
Vision-and-Language Pretrained Models: A Survey
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
43
63
0
15 Apr 2022
Temporal Alignment Networks for Long-term Video
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
29
85
0
06 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
24
17
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
35
94
0
30 Mar 2022
Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers
A. Bucker
Luis F. C. Figueredo
Sami Haddadin
Ashish Kapoor
Shuang Ma
Rogerio Bonatti
LM&Ro
47
49
0
25 Mar 2022
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
Qingkai Fang
Rong Ye
Lei Li
Yang Feng
Mingxuan Wang
55
95
0
20 Mar 2022
Previous
1
2
3
...
5
6
7
...
9
10
11
Next