Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,119 papers shown
Title
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
DongDong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
109
49
0
03 May 2022
Detection of Propaganda Techniques in Visuo-Lingual Metaphor in Memes
Sunil Gundapu
R. Mamidi
62
2
0
03 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
79
16
0
02 May 2022
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
Cheng Chen
Yudong Zhu
Zhenshan Tan
Qingrong Cheng
Xin Jiang
Qun Liu
X. Gu
77
39
0
01 May 2022
Visual Spatial Reasoning
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
133
185
0
30 Apr 2022
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
Li Yang
Yan Xu
Chunfen Yuan
Wei Liu
Bing Li
Weiming Hu
ObjD
92
119
0
30 Apr 2022
Approximating Permutations with Neural Network Components for Travelling Photographer Problem
S. Chong
71
0
0
30 Apr 2022
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Chenyu You
Nuo Chen
Fenglin Liu
Shen Ge
Xian Wu
Yuexian Zou
AuLLM
63
44
0
29 Apr 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
434
3,621
0
29 Apr 2022
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Yuting Gao
Jinfeng Liu
Zihan Xu
Jinchao Zhang
Ke Li
Rongrong Ji
Chunhua Shen
VLM
CLIP
138
105
0
29 Apr 2022
Where in the World is this Image? Transformer-based Geo-localization in the Wild
Shraman Pramanick
E. Nowara
Joshua Gleason
Carlos D. Castillo
Rama Chellappa
ViT
62
37
0
29 Apr 2022
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead
Suzanne Petryk
Vedaad Shakib
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
116
56
0
28 Apr 2022
Reducing Predictive Feature Suppression in Resource-Constrained Contrastive Image-Caption Retrieval
Maurits J. R. Bleeker
Andrew Yates
Maarten de Rijke
100
4
0
28 Apr 2022
CapOnImage: Context-driven Dense-Captioning on Image
Yiqi Gao
Xinglin Hou
Yuanmeng Zhang
T. Ge
Yuning Jiang
Peifeng Wang
139
11
0
27 Apr 2022
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Pinaki Nath Chowdhury
A. Bhunia
Aneeshan Sain
Subhadeep Koley
Tao Xiang
Yi-Zhe Song
104
30
0
25 Apr 2022
Translation between Molecules and Natural Language
Carl Edwards
T. Lai
Kevin Ros
Garrett Honke
Kyunghyun Cho
Heng Ji
139
173
0
25 Apr 2022
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
Yida Zhao
Yuqing Song
Qin Jin
82
29
0
24 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
77
9
0
23 Apr 2022
Unified Pretraining Framework for Document Understanding
Jiuxiang Gu
Jason Kuen
Vlad I. Morariu
Handong Zhao
Nikolaos Barmpalios
R. Jain
A. Nenkova
Tong Sun
105
98
0
22 Apr 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
97
2
0
22 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
82
10
0
21 Apr 2022
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval
Zhiqiang Yuan
Wenkai Zhang
Kun Fu
Xuan Li
Chubo Deng
Hongqi Wang
Xian Sun
101
140
0
21 Apr 2022
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing
Benedikt Boecking
Naoto Usuyama
Shruthi Bannur
Daniel Coelho De Castro
Anton Schwaighofer
...
Tristan Naumann
A. Nori
Javier Alvarez-Valle
Hoifung Poon
Ozan Oktay
93
247
0
21 Apr 2022
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Mustafa Shukor
Guillaume Couairon
Asya Grechka
Matthieu Cord
ViT
89
19
0
20 Apr 2022
Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations
Leila Pishdad
Ran Zhang
Konstantinos G. Derpanis
Allan D. Jepson
Afsaneh Fazly
41
2
0
20 Apr 2022
K-LITE: Learning Transferable Visual Models with External Knowledge
Sheng Shen
Chunyuan Li
Xiaowei Hu
Jianwei Yang
Yujia Xie
...
Ce Liu
Kurt Keutzer
Trevor Darrell
Anna Rohrbach
Jianfeng Gao
CLIP
VLM
72
85
0
20 Apr 2022
Imagination-Augmented Natural Language Understanding
Yujie Lu
Wanrong Zhu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
62
24
0
18 Apr 2022
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
Evonne Ng
Hanbyul Joo
Liwen Hu
Hao Li
Trevor Darrell
Angjoo Kanazawa
Shiry Ginosar
VGen
78
95
0
18 Apr 2022
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Yupan Huang
Tengchao Lv
Lei Cui
Yutong Lu
Furu Wei
171
464
0
18 Apr 2022
Visio-Linguistic Brain Encoding
Subba Reddy Oota
Jashn Arora
Vijay Rowtula
Manish Gupta
R. Bapi
AI4CE
53
18
0
18 Apr 2022
OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval
Yunhao Du
Binyu Zhang
Xiang Ruan
Fei Su
Zhicheng Zhao
Hong Chen
58
5
0
18 Apr 2022
End-to-end Dense Video Captioning as Sequence Generation
Wanrong Zhu
Bo Pang
Ashish V. Thapliyal
William Yang Wang
Radu Soricut
DiffM
61
34
0
18 Apr 2022
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension
Gen Luo
Yiyi Zhou
Jiamu Sun
Xiaoshuai Sun
Rongrong Ji
ObjD
78
10
0
17 Apr 2022
Attention Mechanism based Cognition-level Scene Understanding
Xuejiao Tang
Tai Le Quy
LRM
84
0
0
17 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
64
47
0
16 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
100
55
0
15 Apr 2022
Vision-and-Language Pretrained Models: A Survey
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
111
65
0
15 Apr 2022
XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding
Chan-Jan Hsu
Hung-yi Lee
Yu Tsao
VLM
44
3
0
15 Apr 2022
Brainish: Formalizing A Multimodal Language for Intelligence and Consciousness
Paul Pu Liang
78
6
0
14 Apr 2022
TIB-VA at SemEval-2022 Task 5: A Multimodal Architecture for the Detection and Classification of Misogynous Memes
Sherzod Hakimov
Gullal Singh Cheema
Ralph Ewerth
37
10
0
13 Apr 2022
What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data
Oier Mees
Lukás Hermann
Wolfram Burgard
LM&Ro
117
156
0
13 Apr 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Sanjay Subramanian
William Merrill
Trevor Darrell
Matt Gardner
Sameer Singh
Anna Rohrbach
ObjD
116
128
0
12 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjD
MLLM
VLM
72
50
0
12 Apr 2022
How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image
H. Choi
Yejun Yoon
Seunghyun Yoon
Kunwoo Park
52
8
0
12 Apr 2022
Are Multimodal Transformers Robust to Missing Modality?
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
117
155
0
12 Apr 2022
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Shunyu Zhang
X. Jiang
Zequn Yang
T. Wan
Zengchang Qin
64
12
0
10 Apr 2022
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
Yunxing Kang
Tianqiao Liu
Hang Li
Y. Hao
Wenbiao Ding
78
8
0
10 Apr 2022
On Distinctive Image Captioning via Comparing and Reweighting
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
91
16
0
08 Apr 2022
Unified Contrastive Learning in Image-Text-Label Space
Jianwei Yang
Chunyuan Li
Pengchuan Zhang
Bin Xiao
Ce Liu
Lu Yuan
Jianfeng Gao
VLM
SSL
176
227
0
07 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
168
429
0
07 Apr 2022
Previous
1
2
3
...
26
27
28
...
41
42
43
Next