ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.08530
  4. Cited By
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
v1v2v3v4 (latest)

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
    VLMMLLMSSL
ArXiv (abs)PDFHTMLGithub (740★)

Papers citing "VL-BERT: Pre-training of Generic Visual-Linguistic Representations"

50 / 1,020 papers shown
Title
LoFormer: Local Frequency Transformer for Image Deblurring
LoFormer: Local Frequency Transformer for Image Deblurring
Xintian Mao
Jiansheng Wang
Xingran Xie
Qingli Li
Yan Wang
74
16
0
24 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
58
0
0
23 Jul 2024
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming
  Product Retrieval
Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval
Xiaowan Hu
Yiyi Chen
Yan Li
Minquan Wang
Haoqian Wang
Quan Chen
Han Li
Peng Jiang
AI4TS
78
0
0
23 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with
  Extensive Diversity
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
112
29
0
22 Jul 2024
Benchmark Granularity and Model Robustness for Image-Text Retrieval
Benchmark Granularity and Model Robustness for Image-Text Retrieval
Mariya Hendriksen
Shuo Zhang
R. Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke
75
0
0
21 Jul 2024
Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning
  of CLIP and Fastspeech2
Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2
Chun Xu
En-Wei Sun
75
0
0
19 Jul 2024
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models
  Through 3D Reconstruction
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction
Zaiqiao Meng
Hao Zhou
Yifang Chen
68
4
0
19 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
111
7
0
18 Jul 2024
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
  Vision-Language Models
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Gengze Zhou
Yicong Hong
Zun Wang
Xin Eric Wang
Qi Wu
LM&Ro
96
30
0
17 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
96
15
0
03 Jul 2024
Revisiting Referring Expression Comprehension Evaluation in the Era of
  Large Multimodal Models
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen
Fangyun Wei
Jinjing Zhao
Sizhe Song
Bohuai Wu
Zhuoxuan Peng
S.-H. Gary Chan
Hongyang R. Zhang
103
9
0
24 Jun 2024
Precision Empowers, Excess Distracts: Visual Question Answering With
  Dynamically Infused Knowledge In Language Models
Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models
Manas Jhalani
Annervaz K M
Pushpak Bhattacharyya
38
0
0
14 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLMLRM
82
1
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks
  and Algorithms
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng Zhang
Qi Dai
Chong Luo
Xin Geng
Baining Guo
VLM
84
1
0
13 Jun 2024
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
Samar Fares
Klea Ziu
Toluwani Aremu
Nikita Durasov
Martin Takáč
Pascal Fua
Karthik Nandakumar
Ivan Laptev
VLMAAML
99
5
0
13 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
87
6
0
11 Jun 2024
Translation Deserves Better: Analyzing Translation Artifacts in
  Cross-lingual Visual Question Answering
Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering
Yujin Baek
Koanho Lee
Hyesu Lim
Jaeseok Kim
Junmo Park
Yu-Jung Heo
Du-Seong Chang
Jaegul Choo
38
3
0
04 Jun 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
161
0
0
04 Jun 2024
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision
  Transformer
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Ding Jia
Jianyuan Guo
Kai Han
Han Wu
Chao Zhang
Chang Xu
Xinghao Chen
ViT
168
23
0
03 Jun 2024
Are Large Vision Language Models up to the Challenge of Chart
  Comprehension and Reasoning? An Extensive Investigation into the Capabilities
  and Limitations of LVLMs
Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs
Mohammed Saidul Islam
Raian Rahman
Ahmed Masry
Md Tahmid Rahman Laskar
Mir Tafseer Nayeem
Enamul Hoque
LRMELM
63
13
0
01 Jun 2024
Towards Rationality in Language and Multimodal Agents: A Survey
Towards Rationality in Language and Multimodal Agents: A Survey
Bowen Jiang
Yangxinyu Xie
Xiaomeng Wang
Yuan Yuan
Camillo J Taylor
Tanwi Mallick
Weijie J. Su
Camillo J. Taylor
Tanwi Mallick
LLMAG
85
6
0
01 Jun 2024
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
Shenghuan Sun
Gregory M. Goldgof
Alexander Schubert
Zhiqing Sun
Thomas Hartvigsen
A. Butte
Ahmed Alaa
LM&MA
77
4
0
29 May 2024
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical
  Study of VCR
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Xiaolin Chen
Liqiang Nie
Mohan S. Kankanhalli
LRM
52
8
0
27 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
333
54
0
23 May 2024
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin
M. F. Ahmed
Md. Mushtaq Shahriyar Rafee
VLM
119
3
0
19 May 2024
Enhancing Fine-Grained Image Classifications via Cascaded Vision
  Language Models
Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models
Canshi Wei
VLM
67
0
0
18 May 2024
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions
Junzhang Liu
Zhecan Wang
Hammad A. Ayyubi
Haoxuan You
Chris Thomas
Rui Sun
Shih-Fu Chang
Kai-Wei Chang
155
0
0
18 May 2024
Driving Referring Video Object Segmentation with Vision-Language
  Pre-trained Models
Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models
Zikun Zhou
Wentao Xiong
Li Zhou
Xin Li
Zhenyu He
Yaowei Wang
VOSVLM
59
0
0
17 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
93
10
0
14 May 2024
Promoting AI Equity in Science: Generalized Domain Prompt Learning for
  Accessible VLM Research
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research
Qinglong Cao
Yuntian Chen
Lu Lu
Hao Sun
Zhenzhong Zeng
Xiaokang Yang
Dong-juan Zhang
VLM
58
1
0
14 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
77
2
0
12 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
66
0
0
09 May 2024
Exploring Vision Transformers for 3D Human Motion-Language Models with
  Motion Patches
Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
Qing Yu
Mikihiro Tanaka
Kent Fujiwara
ViT
66
6
0
08 May 2024
POV Learning: Individual Alignment of Multimodal Models using Human Perception
POV Learning: Individual Alignment of Multimodal Models using Human Perception
Simon Werner
Katharina Christ
Laura Bernardy
Marion G. Müller
Achim Rettinger
33
0
0
07 May 2024
Language-Image Models with 3D Understanding
Language-Image Models with 3D Understanding
Jang Hyun Cho
Boris Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
72
19
0
06 May 2024
Model Quantization and Hardware Acceleration for Vision Transformers: A
  Comprehensive Survey
Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey
Dayou Du
Gu Gong
Xiaowen Chu
MQ
140
8
0
01 May 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question
  Answering by Understanding Vietnamese Text in Images
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
97
4
0
29 Apr 2024
Medical Vision-Language Pre-Training for Brain Abnormalities
Medical Vision-Language Pre-Training for Brain Abnormalities
Masoud Monajatipoor
Zi-Yi Dou
Aichi Chien
Nanyun Peng
Kai-Wei Chang
VLM
103
0
0
27 Apr 2024
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking
  Enhances Visual Commonsense Reasoning
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma
Zhihuan Yu
Yichao Ma
Guohui Li
LRM
73
1
0
22 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
117
2
0
22 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
170
5
0
16 Apr 2024
Connecting NeRFs, Images, and Text
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
112
5
0
11 Apr 2024
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic
  Segmentation
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation
Jiannan Ge
Lingxi Xie
Hongtao Xie
Pandeng Li
Xiaopeng Zhang
Yongdong Zhang
Qi Tian
VLM
99
3
0
08 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
Homayoun Najjaran
OOD
114
1
0
05 Apr 2024
Koala: Key frame-conditioned long video-LLM
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
111
41
0
05 Apr 2024
DeViDe: Faceted medical knowledge for improved medical vision-language
  pre-training
DeViDe: Faceted medical knowledge for improved medical vision-language pre-training
Haozhe Luo
Ziyu Zhou
Corentin Royer
Anjany Sekuboyina
Bjoern Menze
VLMViTMedIm
101
7
0
04 Apr 2024
Is CLIP the main roadblock for fine-grained open-world perception?
Is CLIP the main roadblock for fine-grained open-world perception?
Lorenzo Bianchi
F. Carrara
Nicola Messina
Fabrizio Falchi
VLM
81
4
0
04 Apr 2024
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
Seung-bum Chung
Joohyun Park
Hyewon Kan
Hyeongyeop Kang
CLIP
77
1
0
03 Apr 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric
  Vision-Language Pretraining
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
50
5
0
01 Apr 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
184
47
0
29 Mar 2024
Previous
12345...192021
Next