Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,119 papers shown
Title
Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection
Jiaqi Tang
Zhaoyang Liu
Chao Qian
Wayne Wu
Limin Wang
100
18
0
09 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
159
719
0
08 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
66
6
0
08 Dec 2021
Learning music audio representations via weak language supervision
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
161
33
0
08 Dec 2021
Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
Xiwen Liang
Fengda Zhu
Yi Zhu
Bingqian Lin
Bing Wang
Xiaodan Liang
82
23
0
08 Dec 2021
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjD
VLM
189
1,073
0
07 Dec 2021
Cross-domain User Preference Learning for Cold-start Recommendation
Huiling Zhou
Jie Liu
Zhikang Li
Jin Yu
Hongxia Yang
59
0
0
07 Dec 2021
CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
Huidong Liu
Shaoyuan Xu
Jinmiao Fu
Yang Liu
Ning Xie
Chien Wang
Bryan Wang
Yi Sun
CLIP
VLM
72
29
0
07 Dec 2021
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Oier Mees
Lukás Hermann
Erick Rosete-Beas
Wolfram Burgard
LM&Ro
148
263
0
06 Dec 2021
General Facial Representation Learning in a Visual-Linguistic Manner
Yinglin Zheng
Hao Yang
Ting Zhang
Jianmin Bao
Dongdong Chen
Yangyu Huang
Lu Yuan
Dong Chen
Ming Zeng
Fang Wen
CVBM
220
176
0
06 Dec 2021
Joint Learning of Localized Representations from Medical Images and Reports
Philipp Muller
Georgios Kaissis
Cong Zou
Daniel Munich
223
87
0
06 Dec 2021
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Longtian Qiu
Renrui Zhang
Ziyu Guo
Wei Zhang
Zilu Guo
Ziyao Zeng
Guangnan Zhang
VLM
CLIP
89
45
0
04 Dec 2021
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Zhao Yang
Jiaqi Wang
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Philip Torr
228
333
0
04 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
132
133
0
02 Dec 2021
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Yongming Rao
Wenliang Zhao
Guangyi Chen
Yansong Tang
Zheng Zhu
Guan Huang
Jie Zhou
Jiwen Lu
VLM
CLIP
232
584
0
02 Dec 2021
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
103
24
0
02 Dec 2021
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
Christopher Clark
Jordi Salvador
Dustin Schwenk
Derrick Bonafilia
Mark Yatskar
...
Aaron Sarnat
Hannaneh Hajishirzi
Aniruddha Kembhavi
Oren Etzioni
Ali Farhadi
MLLM
57
5
0
01 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
91
82
0
01 Dec 2021
SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing
Jing Shi
Ning Xu
Haitian Zheng
Alex Smith
Jiebo Luo
Chenliang Xu
KELM
DiffM
134
4
0
30 Nov 2021
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
Stan Weixian Lei
Difei Gao
Yuxuan Wang
Dongxing Mao
Zihan Liang
L. Ran
Mike Zheng Shou
80
8
0
30 Nov 2021
Classification-Regression for Chart Comprehension
Matan Levy
Rami Ben-Ari
Dani Lischinski
67
16
0
29 Nov 2021
Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method
Wenda Qin
Teruhisa Misu
Derry Wijaya
UQCV
LM&Ro
83
5
0
28 Nov 2021
LAFITE: Towards Language-Free Training for Text-to-Image Generation
Yufan Zhou
Ruiyi Zhang
Changyou Chen
Chunyuan Li
Chris Tensmeyer
Tong Yu
Jiuxiang Gu
Jinhui Xu
Tong Sun
VLM
105
168
0
27 Nov 2021
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
Changyao Tian
Wenhai Wang
Xizhou Zhu
Jifeng Dai
Yu Qiao
VLM
107
73
0
26 Nov 2021
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics
Aiham Taleb
Matthias Kirchler
Remo Monti
C. Lippert
SSL
MedIm
75
58
0
26 Nov 2021
Neural Collaborative Graph Machines for Table Structure Recognition
Hao Liu
Xin Li
Bin Liu
Deqiang Jiang
Yinsong Liu
Bo Ren
LMTD
113
33
0
26 Nov 2021
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Zipeng Xu
Tianwei Lin
Hao Tang
Fu Li
Dongliang He
N. Sebe
Radu Timofte
Luc Van Gool
Errui Ding
EGVM
99
43
0
26 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
110
75
0
25 Nov 2021
Layered Controllable Video Generation
Jiahui Huang
Yuhe Jin
K. M. Yi
Leonid Sigal
VGen
81
11
0
24 Nov 2021
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
109
12
0
24 Nov 2021
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling
Dat T. Huynh
Jason Kuen
Zhe Lin
Jiuxiang Gu
Ehsan Elhamifar
ISeg
VLM
109
86
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
178
249
0
24 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
149
117
0
23 Nov 2021
Building Goal-Oriented Dialogue Systems with Situated Visual Context
Sanchit Agarwal
Jan Jezabek
Arijit Biswas
Emre Barut
Shuyang Gao
Tagyoung Chung
53
1
0
22 Nov 2021
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
137
169
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming-Hsuan Yang
173
97
0
22 Nov 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning
Keng Ji Chow
Samson Tan
MingSung Kan
LRM
65
4
0
21 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
105
4
0
19 Nov 2021
Combined Scaling for Zero-shot Transfer Learning
Hieu H. Pham
Zihang Dai
Golnaz Ghiasi
Kenji Kawaguchi
Hanxiao Liu
...
Yi-Ting Chen
Minh-Thang Luong
Yonghui Wu
Mingxing Tan
Quoc V. Le
VLM
140
202
0
19 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
78
57
0
19 Nov 2021
ClipCap: CLIP Prefix for Image Captioning
Ron Mokady
Amir Hertz
Amit H. Bermano
CLIP
VLM
85
684
0
18 Nov 2021
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Yaya Shi
Xu Yang
Haiyang Xu
Chunfen Yuan
Bing Li
Weiming Hu
Zhengjun Zha
82
33
0
17 Nov 2021
Achieving Human Parity on Visual Question Answering
Ming Yan
Haiyang Xu
Chenliang Li
Junfeng Tian
Bin Bi
...
Ji Zhang
Songfang Huang
Fei Huang
Luo Si
Rong Jin
63
13
0
17 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Yan Zeng
Xinsong Zhang
Hang Li
VLM
CLIP
100
308
0
16 Nov 2021
LiT: Zero-Shot Transfer with Locked-image text Tuning
Xiaohua Zhai
Tianlin Li
Basil Mustafa
Andreas Steiner
Daniel Keysers
Alexander Kolesnikov
Lucas Beyer
VLM
197
561
0
15 Nov 2021
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou
Chen Wei
Huiyu Wang
Wei Shen
Cihang Xie
Alan Yuille
Tao Kong
92
746
0
15 Nov 2021
Sentiment Analysis of Fashion Related Posts in Social Media
Yifei Yuan
W. Lam
75
8
0
15 Nov 2021
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning
Yizhen Zhang
Minkyu Choi
Kuan Han
Zhongming Liu
VLM
64
17
0
13 Nov 2021
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGS
ViT
207
356
0
11 Nov 2021
Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation
Chuang Lin
Yi Jiang
Jianfei Cai
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
86
32
0
10 Nov 2021
Previous
1
2
3
...
30
31
32
...
41
42
43
Next