ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.08530
  4. Cited By
VL-BERT: Pre-training of Generic Visual-Linguistic Representations

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
    VLM
    MLLM
    SSL
ArXivPDFHTML

Papers citing "VL-BERT: Pre-training of Generic Visual-Linguistic Representations"

50 / 1,012 papers shown
Title
Multimodality Representation Learning: A Survey on Evolution,
  Pretraining and Its Applications
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
42
26
0
01 Feb 2023
Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework
  for Visual Commonsense Reasoning
Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning
Jian Zhu
Hanli Wang
Miaojing Shi
LRM
27
4
0
30 Jan 2023
Lexi: Self-Supervised Learning of the UI Language
Lexi: Self-Supervised Learning of the UI Language
Pratyay Banerjee
Shweti Mahajan
Kushal Arora
Chitta Baral
Oriana Riva
36
17
0
23 Jan 2023
MultiNet with Transformers: A Model for Cancer Diagnosis Using Images
MultiNet with Transformers: A Model for Cancer Diagnosis Using Images
H. Barzekar
Yash J. Patel
L. Tong
Zeyun Yu
MedIm
32
6
0
21 Jan 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Floris Weers
Vaishaal Shankar
Angelos Katharopoulos
Yinfei Yang
Tom Gunter
CLIP
23
4
0
19 Jan 2023
Joint Representation Learning for Text and 3D Point Cloud
Joint Representation Learning for Text and 3D Point Cloud
Rui Huang
Xuran Pan
Henry Zheng
Haojun Jiang
Zhifeng Xie
S. Song
Gao Huang
36
19
0
18 Jan 2023
Multimodal Inverse Cloze Task for Knowledge-based Visual Question
  Answering
Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering
Paul Lerner
O. Ferret
C. Guinaudeau
24
9
0
11 Jan 2023
Universal Multimodal Representation for Language Understanding
Universal Multimodal Representation for Language Understanding
ZhuoSheng Zhang
Kehai Chen
Rui Wang
Masao Utiyama
Eiichiro Sumita
Z. Li
Hai Zhao
SSL
19
21
0
09 Jan 2023
Transferring Pre-trained Multimodal Representations with Cross-modal
  Similarity Matching
Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching
Byoungjip Kim
Sun Choi
Dasol Hwang
Moontae Lee
Honglak Lee
33
10
0
07 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with
  Pre-Training Methods
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
34
15
0
05 Jan 2023
On Transforming Reinforcement Learning by Transformer: The Development
  Trajectory
On Transforming Reinforcement Learning by Transformer: The Development Trajectory
Shengchao Hu
Li Shen
Ya Zhang
Yixin Chen
Dacheng Tao
OffRL
27
25
0
29 Dec 2022
Prototype-guided Cross-task Knowledge Distillation for Large-scale
  Models
Prototype-guided Cross-task Knowledge Distillation for Large-scale Models
Deng Li
Aming Wu
Yahong Han
Qingwen Tian
VLM
33
2
0
26 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Generalized Decoding for Pixel, Image, and Language
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
21
241
0
21 Dec 2022
Tackling Ambiguity with Images: Improved Multimodal Machine Translation
  and Contrastive Evaluation
Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation
Matthieu Futeral
Cordelia Schmid
Ivan Laptev
Benoît Sagot
Rachel Bawden
31
26
0
20 Dec 2022
Fully and Weakly Supervised Referring Expression Segmentation with
  End-to-End Learning
Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning
Hui Li
Mingjie Sun
Jimin Xiao
Eng Gee Lim
Yao-Min Zhao
29
20
0
17 Dec 2022
Visually-augmented pretrained language models for NLP tasks without
  images
Visually-augmented pretrained language models for NLP tasks without images
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Qinyu Zhang
Ji-Rong Wen
VLM
21
10
0
15 Dec 2022
Find Someone Who: Visual Commonsense Understanding in Human-Centric
  Grounding
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Haoxuan You
Rui Sun
Zhecan Wang
Kai-Wei Chang
Shih-Fu Chang
16
5
0
14 Dec 2022
A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic,
  and Multimodal
A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multimodal
K. Liang
Lingyuan Meng
Meng Liu
Yue Liu
Wenxuan Tu
Siwei Wang
Sihang Zhou
Xinwang Liu
Fu Sun
LRM
33
110
0
12 Dec 2022
Learning Video Representations from Large Language Models
Learning Video Representations from Large Language Models
Yue Zhao
Ishan Misra
Philipp Krahenbuhl
Rohit Girdhar
VLM
AI4TS
28
167
0
08 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food
  Retrieval
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIP
CoGe
37
8
0
08 Dec 2022
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal
  Dialogue Dataset
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset
Young-Jun Lee
ByungSoo Ko
Han-Gyu Kim
Jonghwan Hyeon
Ho-Jin Choi
24
7
0
08 Dec 2022
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
Ziqi Zhou
Bowen Zhang
Yinjie Lei
Lingqiao Liu
Yifan Liu
VLM
38
168
0
07 Dec 2022
Unifying Vision, Text, and Layout for Universal Document Processing
Unifying Vision, Text, and Layout for Universal Document Processing
Zineng Tang
Ziyi Yang
Guoxin Wang
Yuwei Fang
Yang Liu
Chenguang Zhu
Michael Zeng
Chao-Yue Zhang
Joey Tianyi Zhou
VLM
32
107
0
05 Dec 2022
Images Speak in Images: A Generalist Painter for In-Context Visual
  Learning
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
Xinlong Wang
Wen Wang
Yue Cao
Chunhua Shen
Tiejun Huang
VLM
MLLM
66
245
0
05 Dec 2022
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu
Biaolong Chen
Yue Liao
Shuwen Xiao
Wenyu Sun
Xiaobo Li
Yousong Zhu
Jinqiao Wang
Si Liu
CLIP
35
11
0
02 Dec 2022
Layout-aware Dreamer for Embodied Referring Expression Grounding
Layout-aware Dreamer for Embodied Referring Expression Grounding
Mingxiao Li
Zehao Wang
Tinne Tuytelaars
Marie-Francine Moens
LM&Ro
17
6
0
30 Nov 2022
PiggyBack: Pretrained Visual Question Answering Environment for Backing
  up Non-deep Learning Professionals
PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals
Zhihao Zhang
Siwen Luo
Junyi Chen
Sijia Lai
Siqu Long
Hyunsuk Chung
S. Han
27
1
0
29 Nov 2022
Survey on Self-Supervised Multimodal Representation Learning and
  Foundation Models
Survey on Self-Supervised Multimodal Representation Learning and Foundation Models
Sushil Thapa
AI4TS
SSL
20
1
0
29 Nov 2022
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and
  Grounding
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Siyi Liu
Yaoyuan Liang
Feng Li
Shijia Huang
Hao Zhang
Hang Su
Jun Zhu
Lei Zhang
ObjD
50
26
0
28 Nov 2022
A Light Touch Approach to Teaching Transformers Multi-view Geometry
A Light Touch Approach to Teaching Transformers Multi-view Geometry
Yash Bhalgat
Joao F. Henriques
Andrew Zisserman
ViT
27
6
0
28 Nov 2022
Who are you referring to? Coreference resolution in image narrations
Who are you referring to? Coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
27
3
0
26 Nov 2022
Multitask Vision-Language Prompt Tuning
Multitask Vision-Language Prompt Tuning
Sheng Shen
Shijia Yang
Tianjun Zhang
Bohan Zhai
Joseph E. Gonzalez
Kurt Keutzer
Trevor Darrell
VLM
VPVLM
19
49
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng-Wei Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffM
VLM
31
17
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
35
3
0
21 Nov 2022
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
  Pre-Training
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training
Ling Yang
Zhilin Huang
Yang Song
Shenda Hong
Ge Li
Wentao Zhang
Bin Cui
Guohao Li
Ming-Hsuan Yang
33
52
0
21 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
16
10
0
20 Nov 2022
A survey on knowledge-enhanced multimodal learning
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
41
14
0
19 Nov 2022
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual
  Question Answering
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
Yao Zhang
Haokun Chen
A. Frikha
Yezi Yang
Denis Krompass
Gengyuan Zhang
Jindong Gu
Volker Tresp
VLM
LRM
16
7
0
19 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
  Information
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
42
41
0
17 Nov 2022
PromptCap: Prompt-Guided Task-Aware Image Captioning
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
53
101
0
15 Nov 2022
YORO -- Lightweight End to End Visual Grounding
YORO -- Lightweight End to End Visual Grounding
Chih-Hui Ho
Srikar Appalaraju
Bhavan A. Jasani
R. Manmatha
Nuno Vasconcelos
ObjD
21
21
0
15 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Grafting Pre-trained Models for Multimodal Headline Generation
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
43
5
0
14 Nov 2022
Understanding ME? Multimodal Evaluation for Fine-grained Visual
  Commonsense
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense
Zhecan Wang
Haoxuan You
Yicheng He
Wenhao Li
Kai-Wei Chang
Shih-Fu Chang
23
5
0
10 Nov 2022
Understanding Cross-modal Interactions in V&L Models that Generate Scene
  Descriptions
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna
Kees van Deemter
Albert Gatt
CoGe
16
4
0
09 Nov 2022
Masked Vision-Language Transformers for Scene Text Recognition
Masked Vision-Language Transformers for Scene Text Recognition
Jie Wu
Ying Peng
Shenmin Zhang
Weigang Qi
Jian Zhang
45
3
0
09 Nov 2022
Logographic Information Aids Learning Better Representations for Natural
  Language Inference
Logographic Information Aids Learning Better Representations for Natural Language Inference
Zijian Jin
Duygu Ataman
28
0
0
03 Nov 2022
Multilingual Multimodality: A Taxonomical Survey of Datasets,
  Techniques, Challenges and Opportunities
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Khyathi Raghavi Chandu
A. Geramifard
40
3
0
30 Oct 2022
DiMBERT: Learning Vision-Language Grounded Representations with
  Disentangled Multimodal-Attention
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
75
12
0
28 Oct 2022
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Chaofan Ma
Yu-Hao Yang
Yanfeng Wang
Ya Zhang
Weidi Xie
VLM
31
48
0
27 Oct 2022
Masked Vision-Language Transformer in Fashion
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
21
25
0
27 Oct 2022
Previous
123...789...192021
Next