Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,094 papers shown
Title
Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?
Maurits J. R. Bleeker
Maarten de Rijke
SSL
DML
29
9
0
14 Feb 2022
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Minzhe Niu
...
Runhu Huang
Wei Zhang
Xingda Jiang
Chunjing Xu
Hang Xu
VLM
53
88
0
14 Feb 2022
UserBERT: Modeling Long- and Short-Term User Preferences via Self-Supervision
Tianyu Li
Ali Cevahir
Derek Cho
Hao Gong
Duy Nguyen
B. Stenger
SSL
18
1
0
14 Feb 2022
Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?
Jiawen Zhang
Abhijit Mishra
Avinesh P.V.S
Siddharth Patwardhan
Sachin Agarwal
26
0
0
09 Feb 2022
Image Difference Captioning with Pre-training and Contrastive Learning
Linli Yao
Weiying Wang
Qin Jin
SSL
VLM
33
41
0
09 Feb 2022
Robotic Grasping from Classical to Modern: A Survey
Hanbo Zhang
Jian Tang
Shiguang Sun
Xuguang Lan
39
39
0
08 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
74
852
0
07 Feb 2022
Webly Supervised Concept Expansion for General Purpose Vision Models
Amita Kamath
Christopher Clark
Tanmay Gupta
Eric Kolve
Derek Hoiem
Aniruddha Kembhavi
VLM
37
54
0
04 Feb 2022
Pre-Trained Language Models for Interactive Decision-Making
Shuang Li
Xavier Puig
Chris Paxton
Yilun Du
Clinton Jia Wang
...
Anima Anandkumar
Jacob Andreas
Igor Mordatch
Antonio Torralba
Yuke Zhu
LM&Ro
53
250
0
03 Feb 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
25
16
0
29 Jan 2022
Kernelized Concept Erasure
Shauli Ravfogel
Francisco Vargas
Yoav Goldberg
Ryan Cotterell
29
32
0
28 Jan 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
397
4,185
0
28 Jan 2022
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Emanuele Bugliarello
Fangyu Liu
Jonas Pfeiffer
Siva Reddy
Desmond Elliott
Edoardo Ponti
Ivan Vulić
MLLM
VLM
ELM
55
62
0
27 Jan 2022
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
Peixi Xiong
Yilin Shen
Hongxia Jin
24
5
0
25 Jan 2022
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
Peixi Xiong
Quanzeng You
Pei Yu
Zicheng Liu
Ying Wu
24
5
0
25 Jan 2022
Text and Code Embeddings by Contrastive Pre-Training
Arvind Neelakantan
Tao Xu
Raul Puri
Alec Radford
Jesse Michael Han
...
Tabarak Khan
Toki Sherbakov
Joanne Jang
Peter Welinder
Lilian Weng
SSL
AI4TS
232
425
0
24 Jan 2022
Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices
Yingying Zhao
Yuhu Chang
Yutian Lu
Yujiang Wang
Mingzhi Dong
...
Robert P. Dick
Fan Yang
Tun Lu
Ning Gu
L. Shang
41
9
0
24 Jan 2022
Learning to Act with Affordance-Aware Multimodal Neural SLAM
Zhiwei Jia
Kaixiang Lin
Yizhou Zhao
Qiaozi Gao
Govind Thattai
Gaurav Sukhatme
LM&Ro
36
15
0
24 Jan 2022
MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis
Georgios Paraskevopoulos
Efthymios Georgiou
Alexandros Potamianos
19
26
0
24 Jan 2022
Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding
Arjun Reddy Akula
OOD
23
3
0
24 Jan 2022
Supervised Visual Attention for Simultaneous Multimodal Machine Translation
Veneta Haralampieva
Ozan Caglayan
Lucia Specia
LRM
50
4
0
23 Jan 2022
A Pre-trained Audio-Visual Transformer for Emotion Recognition
Minh Tran
M. Soleymani
61
25
0
23 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Rohit Girdhar
Mannat Singh
Nikhil Ravi
Laurens van der Maaten
Armand Joulin
Ishan Misra
229
226
0
20 Jan 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
29
166
0
20 Jan 2022
CM3: A Causal Masked Multimodal Model of the Internet
Armen Aghajanyan
Po-Yao (Bernie) Huang
Candace Ross
Vladimir Karpukhin
Hu Xu
...
Dmytro Okhonko
Mandar Joshi
Gargi Ghosh
M. Lewis
Luke Zettlemoyer
22
155
0
19 Jan 2022
TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
Yue Ruan
Han-Hung Lee
Yiming Zhang
Ke Zhang
Angel X. Chang
37
22
0
19 Jan 2022
Generalizable Neuro-symbolic Systems for Commonsense Question Answering
A. Oltramari
Jonathan M Francis
Filip Ilievski
Kaixin Ma
Roshanak Mirzaee
NAI
23
8
0
17 Jan 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
29
103
0
16 Jan 2022
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering
Feng Gao
Q. Ping
Govind Thattai
Aishwarya N. Reganti
Yingting Wu
Premkumar Natarajan
18
17
0
14 Jan 2022
CLIP-Event: Connecting Text and Images with Event Structures
Manling Li
Ruochen Xu
Shuohang Wang
Luowei Zhou
Xudong Lin
Chenguang Zhu
Michael Zeng
Heng Ji
Shih-Fu Chang
VLM
CLIP
27
124
0
13 Jan 2022
Towards Automated Error Analysis: Learning to Characterize Errors
Tong Gao
Shivang Singh
Raymond J. Mooney
14
1
0
13 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
33
19
0
11 Jan 2022
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering
Ankur Sikarwar
Gabriel Kreiman
ViT
24
1
0
11 Jan 2022
Music2Video: Automatic Generation of Music Video with fusion of audio and text
Yoonjeon Kim
Joel Jang
Sumin Shin
DiffM
VGen
38
7
0
11 Jan 2022
DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population
Ningyu Zhang
Xin Xu
Li Tao
Haiyang Yu
Hongbin Ye
...
Qiang Chen
Feiyu Xiong
Fei Huang
Guozhou Zheng
Huajun Chen
VLM
HAI
40
42
0
10 Jan 2022
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval
Zhixiong Zeng
Wenji Mao
VLM
21
17
0
08 Jan 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
48
207
0
07 Jan 2022
Self-Training Vision Language BERTs with a Unified Conditional Model
Xiaofeng Yang
Fengmao Lv
Fayao Liu
Guosheng Lin
SSL
VLM
54
13
0
06 Jan 2022
Automatic Related Work Generation: A Meta Study
Xiangci Li
Jessica Ouyang
36
10
0
06 Jan 2022
Discrete and continuous representations and processing in deep learning: Looking forward
Ruben Cartuyvels
Graham Spinks
Marie-Francine Moens
OCL
38
20
0
04 Jan 2022
Semantically Grounded Visual Embeddings for Zero-Shot Learning
Shah Nawaz
Jacopo Cavazza
Alessio Del Bue
ObjD
FedML
VLM
28
3
0
03 Jan 2022
Deconfounded Visual Grounding
Jianqiang Huang
Yu Qin
Jiaxin Qi
Qianru Sun
Hanwang Zhang
CML
ObjD
21
31
0
31 Dec 2021
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation
Han Zhang
Weichong Yin
Yewei Fang
Lanxin Li
Boqiang Duan
Zhihua Wu
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
32
59
0
31 Dec 2021
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
Mengde Xu
Zheng Zhang
Fangyun Wei
Yutong Lin
Yue Cao
Han Hu
Xiang Bai
VLM
25
212
0
29 Dec 2021
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision
Ajinkya Tejankar
Maziar Sanjabi
Bichen Wu
Saining Xie
Madian Khabsa
Hamed Pirsiavash
Hamed Firooz
VLM
36
17
0
27 Dec 2021
Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech
Gaoussou Youssouf Kebe
Luke E. Richards
Edward Raff
Francis Ferraro
Cynthia Matuszek
SSL
24
5
0
27 Dec 2021
LaTr: Layout-Aware Transformer for Scene-Text VQA
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
39
100
0
23 Dec 2021
Understanding and Measuring Robustness of Multimodal Learning
Nishant Vishwamitra
Hongxin Hu
Ziming Zhao
Long Cheng
Feng Luo
AAML
27
5
0
22 Dec 2021
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation
Xu Yan
Zhihao Yuan
Yuhao Du
Yinghong Liao
Yao Guo
Zhen Li
Shuguang Cui
3DPC
CoGe
23
14
0
22 Dec 2021
Hateful Memes Challenge: An Enhanced Multimodal Framework
Aijing Gao
Bingjun Wang
Jiaqi Yin
Yating Tian
21
2
0
20 Dec 2021
Previous
1
2
3
...
28
29
30
...
40
41
42
Next