Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.08530
Cited By
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,012 papers shown
Title
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
37
11
0
14 Apr 2023
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
Zhe-nan Lin
Xidong Peng
Peishan Cong
Ge Zheng
Yujin Sun
Yuenan Hou
Xinge Zhu
Sibei Yang
Yuexin Ma
VGen
92
4
0
12 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
21
4
0
11 Apr 2023
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training
Yunpeng Han
Lisai Zhang
Qingcai Chen
Zhijian Chen
Zhonghua Li
Jianxin Yang
Bo Zhao
AI4TS
VLM
28
11
0
11 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
21
1
0
10 Apr 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
43
51
0
06 Apr 2023
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens
Zhanyu Wang
Lingqiao Liu
Lei Wang
Luping Zhou
MedIm
21
74
0
05 Apr 2023
Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver
Xianpeng Liu
Ce Zheng
K. Cheng
Nan Xue
Guo-Jun Qi
Tianfu Wu
3DPC
34
6
0
03 Apr 2023
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun
Pengchuan Zhang
Peizhao Zhang
Hardik Shah
Kate Saenko
Xide Xia
VLM
25
20
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
24
44
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
62
8
0
30 Mar 2023
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Sifan Long
Zhen Zhao
Junkun Yuan
Zichang Tan
Jiangjiang Liu
Luping Zhou
Sheng-sheng Wang
Jingdong Wang
VLM
31
2
0
30 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
37
23
0
29 Mar 2023
Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation
Yuliang Cai
Jesse Thomason
Mohammad Rostami
VLM
CLL
19
11
0
25 Mar 2023
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
Junjie Ke
Keren Ye
Jiahui Yu
Yonghui Wu
P. Milanfar
Feng Yang
VLM
54
56
0
24 Mar 2023
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Ding Jiang
Mang Ye
35
139
0
22 Mar 2023
VideoXum: Cross-modal Visual and Textural Summarization of Videos
Jingyang Lin
Hang Hua
Ming Chen
Yikang Li
Jenhao Hsiao
C. Ho
Jiebo Luo
36
30
0
21 Mar 2023
Retrieving Multimodal Information for Augmented Generation: A Survey
Ruochen Zhao
Hailin Chen
Weishi Wang
Fangkai Jiao
Do Xuan Long
...
Bosheng Ding
Xiaobao Guo
Minzhi Li
Xingxuan Li
Chenyu You
31
82
0
20 Mar 2023
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Ke Zhu
CLIP
71
14
0
14 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
26
63
0
13 Mar 2023
DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation
Yueming Lyu
Tianwei Lin
Fu Li
Dongliang He
Jing Dong
Tien-Ping Tan
41
39
0
11 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
33
1
0
09 Mar 2023
TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions
He Zhu
Xihua Li
Xuemin Zhao
Yunbo Cao
Shan Yu
23
0
0
09 Mar 2023
Text-Visual Prompting for Efficient 2D Temporal Video Grounding
Yimeng Zhang
Xin Chen
Jinghan Jia
Sijia Liu
Ke Ding
23
25
0
09 Mar 2023
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
Yihan Cao
Siyu Li
Yixin Liu
Zhiling Yan
Yutong Dai
Philip S. Yu
Lichao Sun
38
508
0
07 Mar 2023
Prismer: A Vision-Language Model with Multi-Task Experts
Shikun Liu
Linxi Fan
Edward Johns
Zhiding Yu
Chaowei Xiao
Anima Anandkumar
VLM
MLLM
49
21
0
04 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
24
38
0
04 Mar 2023
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Maria Lymperaiou
Giorgos Stamou
VLM
32
4
0
04 Mar 2023
Structure Pretraining and Prompt Tuning for Knowledge Graph Transfer
Wen Zhang
Yushan Zhu
Mingyang Chen
Yuxia Geng
Yufen Huang
Yajing Xu
Wenting Song
Hua-zeng Chen
51
26
0
03 Mar 2023
TextIR: A Simple Framework for Text-based Editable Image Restoration
Yun-Hao Bai
Cairong Wang
Shuzhao Xie
Chao Dong
Chun Yuan
Zhi Wang
DiffM
30
15
0
28 Feb 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
Contrastive Video Question Answering via Video Graph Transformer
Junbin Xiao
Pan Zhou
Angela Yao
Yicong Li
Richang Hong
Shuicheng Yan
Tat-Seng Chua
ViT
27
35
0
27 Feb 2023
Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model
Jaeyoung Huh
Sangjoon Park
Jeonghyeon Lee
Jong Chul Ye
LM&MA
22
9
0
27 Feb 2023
Understanding Social Media Cross-Modality Discourse in Linguistic Space
Chunpu Xu
Hanzhuo Tan
Jing Li
Piji Li
24
6
0
26 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
24
14
0
24 Feb 2023
Side Adapter Network for Open-Vocabulary Semantic Segmentation
Mengde Xu
Zheng-Wei Zhang
Fangyun Wei
Han Hu
Xiang Bai
VLM
26
248
0
23 Feb 2023
Entity-Level Text-Guided Image Manipulation
Yikai Wang
Jianan Wang
Guansong Lu
Hang Xu
Zhenguo Li
Wei Zhang
Yanwei Fu
VGen
34
3
0
22 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
48
205
0
20 Feb 2023
CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension
Zhi Zhang
H. Yannakoudakis
Xiantong Zhen
Ekaterina Shutova
29
2
0
17 Feb 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Zhihong Chen
Shizhe Diao
Benyou Wang
Guanbin Li
Xiang Wan
MedIm
27
29
0
17 Feb 2023
Retrieval-augmented Image Captioning
R. Ramos
Desmond Elliott
Bruno Martins
VLM
32
29
0
16 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
33
7
0
16 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
42
40
0
14 Feb 2023
Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval
Ben Chen
Linbo Jin
Xinxin Wang
D. Gao
Wen Jiang
Wei Ning
22
3
0
10 Feb 2023
BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization
Weichao Zhao
Hezhen Hu
Wen-gang Zhou
Jiaxin Shi
Houqiang Li
SLR
23
30
0
10 Feb 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Fan Liu
Liqiang Nie
Mohan S. Kankanhalli
40
10
0
04 Feb 2023
Self-Supervised Relation Alignment for Scene Graph Generation
Bicheng Xu
Renjie Liao
Leonid Sigal
29
0
0
02 Feb 2023
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment
Hao Liu
Wilson Yan
Pieter Abbeel
34
25
0
02 Feb 2023
CLIPood: Generalizing CLIP to Out-of-Distributions
Yang Shu
Xingzhuo Guo
Jialong Wu
Ximei Wang
Jianmin Wang
Mingsheng Long
OODD
VLM
52
73
0
02 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
43
161
0
01 Feb 2023
Previous
1
2
3
...
6
7
8
...
19
20
21
Next