Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.08530
Cited By
v1
v2
v3
v4 (latest)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (740★)
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,020 papers shown
Title
YORO -- Lightweight End to End Visual Grounding
Chih-Hui Ho
Srikar Appalaraju
Bhavan A. Jasani
R. Manmatha
Nuno Vasconcelos
ObjD
60
22
0
15 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
79
6
0
14 Nov 2022
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense
Zhecan Wang
Haoxuan You
Yicheng He
Wenhao Li
Kai-Wei Chang
Shih-Fu Chang
116
5
0
10 Nov 2022
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna
Kees van Deemter
Albert Gatt
CoGe
54
4
0
09 Nov 2022
Masked Vision-Language Transformers for Scene Text Recognition
Jie Wu
Ying Peng
Shenmin Zhang
Weigang Qi
Jian Zhang
71
3
0
09 Nov 2022
Logographic Information Aids Learning Better Representations for Natural Language Inference
Zijian Jin
Duygu Ataman
58
1
0
03 Nov 2022
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Khyathi Chandu
A. Geramifard
70
3
0
30 Oct 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
108
13
0
28 Oct 2022
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Chaofan Ma
Yu-Hao Yang
Yanfeng Wang
Ya Zhang
Weidi Xie
VLM
74
48
0
27 Oct 2022
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
90
25
0
27 Oct 2022
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Suvir Mirchandani
Licheng Yu
Mengjiao MJ Wang
Animesh Sinha
Wen-Jun Jiang
Tao Xiang
Ning Zhang
81
16
0
26 Oct 2022
End-to-End Multimodal Representation Learning for Video Dialog
Huda AlAmri
Anthony Bilic
Michael Hu
Apoorva Beedu
Irfan Essa
84
7
0
26 Oct 2022
From colouring-in to pointillism: revisiting semantic segmentation supervision
Rodrigo Benenson
V. Ferrari
VLM
74
21
0
25 Oct 2022
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
Sahithya Ravi
Aditya Chinchure
Leonid Sigal
Renjie Liao
Vered Shwartz
71
29
0
24 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
45
6
0
24 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
57
6
0
24 Oct 2022
Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination
Yue Yang
Wenlin Yao
Hongming Zhang
Xiaoyang Wang
Dong Yu
Jianshu Chen
VLM
99
22
0
21 Oct 2022
SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation
Zekun Li
Jina Kim
Yao-Yi Chiang
Muhao Chen
133
31
0
21 Oct 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Mitja Nikolaus
Emmanuelle Salin
Stéphane Ayache
Abdellah Fourtassi
Benoit Favre
76
14
0
21 Oct 2022
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Yuechen Wang
Wen-gang Zhou
Houqiang Li
AI4TS
63
13
0
21 Oct 2022
VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors
Yifeng Zhu
Abhishek Joshi
Peter Stone
Yuke Zhu
LM&Ro
97
134
0
20 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
118
8
0
19 Oct 2022
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation
Pengfei Li
Beiwen Tian
Yongliang Shi
Xiaoxue Chen
Hao Zhao
Guyue Zhou
Ya Zhang
118
22
0
19 Oct 2022
Non-Contrastive Learning Meets Language-Image Pre-Training
Jinghao Zhou
Li Dong
Zhe Gan
Lijuan Wang
Furu Wei
VLM
CLIP
75
26
0
17 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLM
CLIP
79
54
0
17 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
82
44
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Jane Polak Scowcroft
Pascale Fung
MLLM
VLM
88
67
0
14 Oct 2022
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
Mojtaba Valipour
Mehdi Rezagholizadeh
I. Kobyzev
A. Ghodsi
160
184
0
14 Oct 2022
That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data
Denis Jered McInerney
Geoffrey S. Young
Jan-Willem van de Meent
Byron C. Wallace
45
0
0
12 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
52
1
0
12 Oct 2022
Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers
Moritz Einfalt
K. Ludwig
Rainer Lienhart
101
47
0
12 Oct 2022
Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning
Fuying Wang
Yuyin Zhou
Shujun Wang
V. Vardhanabhuti
Lequan Yu
117
149
0
12 Oct 2022
Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features
Gokul Karthik Kumar
Karthik Nandakumar
VLM
CLIP
86
66
0
12 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
82
33
0
11 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
104
4
0
10 Oct 2022
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Zijia Zhao
Longteng Guo
Xingjian He
Shuai Shao
Zehuan Yuan
Jing Liu
105
9
0
09 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
82
22
0
09 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
95
2
0
08 Oct 2022
Video Referring Expression Comprehension via Transformer with Content-aware Query
Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
88
5
0
06 Oct 2022
Domain-Unified Prompt Representations for Source-Free Domain Generalization
Hongjing Niu
Hanting Li
Feng Zhao
Bin Li
VLM
117
19
0
29 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
105
28
0
28 Sep 2022
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
83
34
0
27 Sep 2022
Unsupervised Hashing with Semantic Concept Mining
Rong-Cheng Tu
Xian-Ling Mao
Kevin Qinghong Lin
Chengfei Cai
Weize Qin
Hongfa Wang
Wei Wei
Heyan Huang
125
12
0
23 Sep 2022
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
Rogerio Bonatti
Sai H. Vemprala
Shuang Ma
Felipe Vieira Frujeri
Shuhang Chen
Ashish Kapoor
94
23
0
22 Sep 2022
Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Yiren Jian
Chongyang Gao
Soroush Vosoughi
SSL
95
15
0
20 Sep 2022
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Lovisa Hagström
Richard Johansson
VLM
64
4
0
19 Sep 2022
LAVIS: A Library for Language-Vision Intelligence
Dongxu Li
Junnan Li
Hung Le
Guangsen Wang
Silvio Savarese
Guosheng Lin
VLM
192
56
0
15 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
139
153
0
15 Sep 2022
Distribution Aware Metrics for Conditional Natural Language Generation
David M. Chan
Yiming Ni
David A. Ross
Sudheendra Vijayanarasimhan
Austin Myers
John F. Canny
77
4
0
15 Sep 2022
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Zhihong Chen
Guanbin Li
Xiang Wan
178
73
0
15 Sep 2022
Previous
1
2
3
...
8
9
10
...
19
20
21
Next