Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.08530
Cited By
v1
v2
v3
v4 (latest)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (740★)
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,020 papers shown
Title
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
Cheng Shi
Sibei Yang
VLM
90
21
0
03 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
80
5
0
30 Aug 2023
CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation
Devaansh Gupta
Siddhant Kharbanda
Jiawei Zhou
Wanhua Li
Hanspeter Pfister
D. Wei
VLM
86
13
0
29 Aug 2023
CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
75
21
0
28 Aug 2023
A Multi-Task Semantic Decomposition Framework with Task-specific Pre-training for Few-Shot NER
Guanting Dong
Zechen Wang
Jinxu Zhao
Gang Zhao
Daichi Guo
...
Keqing He
Xuefeng Li
Liwen Wang
Xinyue Cui
Weiran Xu
84
22
0
28 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
196
945
0
24 Aug 2023
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
Zichao Dong
Weikun Zhang
Xufeng Huang
Hang Ji
Xin Zhan
Junbo Chen
VLM
47
4
0
24 Aug 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLM
VLM
MoE
73
10
0
23 Aug 2023
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
85
14
0
22 Aug 2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
Bilel Benjdira
Anis Koubaa
Anas M. Ali
LM&Ro
58
4
0
22 Aug 2023
MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation
Jinpeng Wang
Ziyun Zeng
Yunxiao Wang
Yuting Wang
Xingyu Lu
Tianxiang Li
Jun Yuan
Rui Zhang
Haitao Zheng
Shutao Xia
95
48
0
22 Aug 2023
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Haokun Chen
Yao Zhang
Denis Krompass
Jindong Gu
Volker Tresp
FedML
114
54
0
21 Aug 2023
BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction
Dong Wang
Kave Salamatian
Yunqing Xia
Weiwei Deng
Qi Zhang
56
14
0
17 Aug 2023
MM-GEF: Multi-modal representation meet collaborative filtering
Hao Wu
Alejandro Ariza-Casabona
Bartlomiej Twardowski
Tri Kurniawan Wijaya
49
2
0
14 Aug 2023
ViGT: Proposal-free Video Grounding with Learnable Token in Transformer
Kun Li
Dan Guo
Meng Wang
ViT
79
42
0
11 Aug 2023
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
Xuehan Bai
Yan Li
Yong Cheng
Wenjie Yang
Quanming Chen
Han Li
61
4
0
10 Aug 2023
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models
Zheng Ma
Mianzhi Pan
Wenhan Wu
Ka Leong Cheng
Jianbing Zhang
Shujian Huang
Jiajun Chen
VLM
CoGe
73
5
0
06 Aug 2023
Open-Set Domain Adaptation with Visual-Language Foundation Models
Qing Yu
Go Irie
Kiyoharu Aizawa
VLM
111
7
0
30 Jul 2023
Med-Flamingo: a Multimodal Medical Few-shot Learner
Michael Moor
Qian Huang
Shirley Wu
Michihiro Yasunaga
C. Zakka
Yashodhara Dalmia
E. Reis
Pranav Rajpurkar
J. Leskovec
LM&MA
MedIm
91
272
0
27 Jul 2023
MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Classes and Hard Negative Entities
Yongqian Li
Tingwei Lu
Hai-Tao Zheng
Tianyu Yu
Shulin Huang
Haitao Zheng
Rui Zhang
Jun Yuan
95
11
0
27 Jul 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
57
5
0
26 Jul 2023
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Jinxian Liu
Chen Ju
Chaofan Ma
Yanfeng Wang
Yu Wang
Ya Zhang
VOS
127
24
0
25 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
Yiyuan Zhang
Kaixiong Gong
Kaipeng Zhang
Hongsheng Li
Yu Qiao
Wanli Ouyang
Xiangyu Yue
105
150
0
20 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
63
9
0
17 Jul 2023
PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Kiet Van Nguyen
42
2
0
17 Jul 2023
Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making
Ruipu Luo
Jiwen Zhang
Zhongyu Wei
VLM
40
0
0
16 Jul 2023
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts
Ziyi Wang
Jian Liang
Ran He
Nana Xu
Zilei Wang
Tien-Ping Tan
VLM
102
53
0
14 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
216
1
0
14 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
98
31
0
13 Jul 2023
One-Versus-Others Attention: Scalable Multimodal Integration for Clinical Data
Michal Golovanevsky
Eva Schiller
Akira Nair
Ritambhara Singh
Carsten Eickhoff
72
3
0
11 Jul 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
MLLM
VLM
168
238
0
07 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
53
5
0
06 Jul 2023
Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning
K. Liang
Sihang Zhou
Yue Liu
Lingyuan Meng
Meng Liu
Xinwang Liu
105
16
0
06 Jul 2023
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Rui Sun
Zhecan Wang
Haoxuan You
Noel Codella
Kai-Wei Chang
Shih-Fu Chang
CLIP
105
4
0
03 Jul 2023
MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling
Zhenyu Zhang
Wenhao Chai
Zhongyu Jiang
Tianbo Ye
Xiuming Zhang
Lei Li
Gaoang Wang
3DH
58
5
0
29 Jun 2023
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Qiong Wu
Shubin Huang
Yiyi Zhou
Pingyang Dai
Annan Shu
Guannan Jiang
Rongrong Ji
VLM
VPVLM
42
2
0
27 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
45
5
0
25 Jun 2023
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Yuxiang Cai
DiffM
VLM
146
66
0
20 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Hongsheng Li
MLLM
97
27
0
15 Jun 2023
Recurrent Action Transformer with Memory
A. Staroverov
A. Bessonov
Dmitry A. Yudin
A. Kovalev
Aleksandr I. Panov
OffRL
106
7
0
15 Jun 2023
Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal
Zhentao Tan
Yue-bo Wu
Qiankun Liu
Qi Chu
Le Lu
Jieping Ye
Nenghai Yu
95
13
0
15 Jun 2023
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
Ziqiao Ma
Jiayi Pan
J. Chai
ObjD
VLM
72
9
0
14 Jun 2023
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Zeju Qiu
Wei-yu Liu
Haiwen Feng
Yuxuan Xue
Yao Feng
Zhen Liu
Dan Zhang
Adrian Weller
Bernhard Schölkopf
DiffM
126
158
0
12 Jun 2023
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Saidul Islam
Hanae Elmekki
Ahmed Elsebai
Jamal Bentahar
Najat Drawel
Gaith Rjoub
Witold Pedrycz
ViT
MedIm
89
210
0
11 Jun 2023
A blind spot for large language models: Supradiegetic linguistic information
Julia Witte Zimmerman
Denis Hudon
Kathryn Cramer
Jonathan St. Onge
M. Fudolig
Milo Z. Trujillo
C. Danforth
P. Dodds
21
3
0
11 Jun 2023
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Li Xu
Bo Liu
Ameer Hamza Khan
Lu Fan
Xiao-Ming Wu
LM&MA
65
9
0
10 Jun 2023
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
Fuxiao Liu
Hao Tan
Chris Tensmeyer
CLIP
VLM
99
18
0
09 Jun 2023
Read, look and detect: Bounding box annotation from image-caption pairs
E. Sanchez
ObjD
64
0
0
09 Jun 2023
Previous
1
2
3
4
5
6
...
19
20
21
Next