ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.08530
  4. Cited By
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
v1v2v3v4 (latest)

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
    VLMMLLMSSL
ArXiv (abs)PDFHTMLGithub (740★)

Papers citing "VL-BERT: Pre-training of Generic Visual-Linguistic Representations"

50 / 1,020 papers shown
Title
Improving and Diagnosing Knowledge-Based Visual Question Answering via
  Entity Enhanced Knowledge Injection
Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
Diego Garcia-Olano
Yasumasa Onoe
Joydeep Ghosh
69
18
0
13 Dec 2021
Unified Multimodal Pre-training and Prompt-based Tuning for
  Vision-Language Understanding and Generation
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Jingjing Chen
Yu-Gang Jiang
VLMMLLM
88
10
0
10 Dec 2021
MAGMA -- Multimodal Augmentation of Generative Models through
  Adapter-based Finetuning
MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning
C. Eichenberg
Sid Black
Samuel Weinbach
Letitia Parcalabescu
Anette Frank
MLLMVLM
72
101
0
09 Dec 2021
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions
Rameen Abdal
Peihao Zhu
John C. Femiani
Niloy J. Mitra
Peter Wonka
CLIP
79
107
0
09 Dec 2021
HairCLIP: Design Your Hair by Text and Reference Image
HairCLIP: Design Your Hair by Text and Reference Image
Tianyi Wei
Dongdong Chen
Wenbo Zhou
Jing Liao
Zhentao Tan
Lu Yuan
Weiming Zhang
Nenghai Yu
CLIP
71
111
0
09 Dec 2021
Multimodal Pre-Training Model for Sequence-based Prediction of
  Protein-Protein Interaction
Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction
Yang Xue
Zijing Liu
Xiaomin Fang
Fan Wang
111
8
0
09 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
151
719
0
08 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
60
6
0
08 Dec 2021
Grounded Language-Image Pre-training
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjDVLM
178
1,071
0
07 Dec 2021
CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
Huidong Liu
Shaoyuan Xu
Jinmiao Fu
Yang Liu
Ning Xie
Chien Wang
Bryan Wang
Yi Sun
CLIPVLM
72
29
0
07 Dec 2021
From Coarse to Fine-grained Concept based Discrimination for Phrase
  Detection
From Coarse to Fine-grained Concept based Discrimination for Phrase Detection
Maan Qraitem
Bryan A. Plummer
ObjD
56
0
0
06 Dec 2021
General Facial Representation Learning in a Visual-Linguistic Manner
General Facial Representation Learning in a Visual-Linguistic Manner
Yinglin Zheng
Hao Yang
Ting Zhang
Jianmin Bao
Dongdong Chen
Yangyu Huang
Lu Yuan
Dong Chen
Ming Zeng
Fang Wen
CVBM
207
176
0
06 Dec 2021
Joint Learning of Localized Representations from Medical Images and
  Reports
Joint Learning of Localized Representations from Medical Images and Reports
Philipp Muller
Georgios Kaissis
Cong Zou
Daniel Munich
210
87
0
06 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
124
133
0
02 Dec 2021
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Yongming Rao
Wenliang Zhao
Guangyi Chen
Yansong Tang
Zheng Zhu
Guan Huang
Jie Zhou
Jiwen Lu
VLMCLIP
226
583
0
02 Dec 2021
Video-Text Pre-training with Learned Regions
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
96
24
0
02 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViTVLM
70
82
0
01 Dec 2021
NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging
NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging
Zihan Liu
Feijun Jiang
Yuxiang Hu
Chen Shi
Pascale Fung
126
38
0
01 Dec 2021
AssistSR: Task-oriented Video Segment Retrieval for Personal AI
  Assistant
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
Stan Weixian Lei
Difei Gao
Yuxuan Wang
Dongxing Mao
Zihan Liang
L. Ran
Mike Zheng Shou
69
8
0
30 Nov 2021
LAFITE: Towards Language-Free Training for Text-to-Image Generation
LAFITE: Towards Language-Free Training for Text-to-Image Generation
Yufan Zhou
Ruiyi Zhang
Changyou Chen
Chunyuan Li
Chris Tensmeyer
Tong Yu
Jiuxiang Gu
Jinhui Xu
Tong Sun
VLM
86
168
0
27 Nov 2021
VL-LTR: Learning Class-wise Visual-Linguistic Representation for
  Long-Tailed Visual Recognition
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
Changyao Tian
Wenhai Wang
Xizhou Zhu
Jifeng Dai
Yu Qiao
VLM
105
73
0
26 Nov 2021
Neural Collaborative Graph Machines for Table Structure Recognition
Neural Collaborative Graph Machines for Table Structure Recognition
Hao Liu
Xin Li
Bin Liu
Deqiang Jiang
Yinsong Liu
Bo Ren
LMTD
113
33
0
26 Nov 2021
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image
  Manipulation Empowered by Pre-Trained Vision-Language Model
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Zipeng Xu
Tianwei Lin
Hao Tang
Fu Li
Dongliang He
N. Sebe
Radu Timofte
Luc Van Gool
Errui Ding
EGVM
95
43
0
26 Nov 2021
Layered Controllable Video Generation
Layered Controllable Video Generation
Jiahui Huang
Yuhe Jin
K. M. Yi
Leonid Sigal
VGen
68
11
0
24 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
  Modeling
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
154
221
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLMVLM
170
249
0
24 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
146
117
0
23 Nov 2021
Building Goal-Oriented Dialogue Systems with Situated Visual Context
Building Goal-Oriented Dialogue Systems with Situated Visual Context
Sanchit Agarwal
Jan Jezabek
Arijit Biswas
Emre Barut
Shuyang Gao
Tagyoung Chung
50
1
0
22 Nov 2021
RedCaps: web-curated image-text data created by the people, for the
  people
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
137
169
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
Class-agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming-Hsuan Yang
141
97
0
22 Nov 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
  Visio-Linguistic Reasoning
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning
Keng Ji Chow
Samson Tan
MingSung Kan
LRM
63
4
0
21 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
97
4
0
19 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
78
57
0
19 Nov 2021
Achieving Human Parity on Visual Question Answering
Achieving Human Parity on Visual Question Answering
Ming Yan
Haiyang Xu
Chenliang Li
Junfeng Tian
Bin Bi
...
Ji Zhang
Songfang Huang
Fei Huang
Luo Si
Rong Jin
61
13
0
17 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
  Concepts
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Yan Zeng
Xinsong Zhang
Hang Li
VLMCLIP
95
308
0
16 Nov 2021
LiT: Zero-Shot Transfer with Locked-image text Tuning
LiT: Zero-Shot Transfer with Locked-image text Tuning
Xiaohua Zhai
Tianlin Li
Basil Mustafa
Andreas Steiner
Daniel Keysers
Alexander Kolesnikov
Lucas Beyer
VLM
171
561
0
15 Nov 2021
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou
Chen Wei
Huiyu Wang
Wei Shen
Cihang Xie
Alan Yuille
Tao Kong
92
746
0
15 Nov 2021
Attention Mechanisms in Computer Vision: A Survey
Attention Mechanisms in Computer Vision: A Survey
Meng-Hao Guo
Tianhan Xu
Jiangjiang Liu
Zheng-Ning Liu
Peng-Tao Jiang
Tai-Jiang Mu
Song-Hai Zhang
Ralph Robert Martin
Ming-Ming Cheng
Shimin Hu
142
1,741
0
15 Nov 2021
Explainable Semantic Space by Grounding Language to Vision with
  Cross-Modal Contrastive Learning
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning
Yizhen Zhang
Minkyu Choi
Kuan Han
Zhongming Liu
VLM
64
17
0
13 Nov 2021
A Survey of Visual Transformers
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGSViT
191
356
0
11 Nov 2021
NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Shasta Ihorn
Y. Siu
Aditya Bodi
Lothar D Narins
Jose M. Castanon
Yash Kant
Abhishek Das
Ilmi Yoon
Pooyan Fazli
44
3
0
07 Nov 2021
An Empirical Study of Training End-to-End Vision-and-Language
  Transformers
An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou
Yichong Xu
Zhe Gan
Jianfeng Wang
Shuohang Wang
...
Pengchuan Zhang
Lu Yuan
Nanyun Peng
Zicheng Liu
Michael Zeng
VLM
104
381
0
03 Nov 2021
VLMo: Unified Vision-Language Pre-Training with
  Mixture-of-Modality-Experts
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLMMLLMMoE
104
560
0
03 Nov 2021
Towards artificial general intelligence via a multimodal foundation
  model
Towards artificial general intelligence via a multimodal foundation model
Nanyi Fei
Zhiwu Lu
Yizhao Gao
Guoxing Yang
Yuqi Huo
...
Ruihua Song
Xin Gao
Tao Xiang
Haoran Sun
Jiling Wen
AI4CELRM
90
230
0
27 Oct 2021
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
  Emotion Recognition
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition
Jinming Zhao
Ruichen Li
Qin Jin
Xinchao Wang
Haizhou Li
49
25
0
27 Oct 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning
  for Visual Sound Separation
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
75
8
0
26 Oct 2021
Text-Based Person Search with Limited Data
Text-Based Person Search with Limited Data
Xiaoping Han
Sen He
Li Zhang
Tao Xiang
94
91
0
20 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal
  Retrieval
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViTVLM
98
21
0
20 Oct 2021
TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
  Estimation
TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation
Haoyu Ma
Liangjian Chen
Deying Kong
Zhe Wang
Xingwei Liu
Hao Tang
Xiangyi Yan
Yusheng Xie
Shi-yao Lin
Xiaohui Xie
ViT
124
61
0
18 Oct 2021
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal
  Instructional Manuals
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu
Alexander Spangher
Pegah Alipoormolabashi
Marjorie Freedman
R. Weischedel
Nanyun Peng
78
23
0
16 Oct 2021
Previous
123...131415...192021
Next