Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.08530
Cited By
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,012 papers shown
Title
MCA: Moment Channel Attention Networks
Yangbo Jiang
Zhiwei Jiang
Le Han
Zenan Huang
Nenggan Zheng
22
3
0
04 Mar 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
52
47
0
29 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
40
3
0
27 Feb 2024
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Jiazhao Zhang
Kunyu Wang
Rongtao Xu
Gengze Zhou
Yicong Hong
Xiaomeng Fang
Qi Wu
Zhizheng Zhang
Wang He
LM&Ro
40
45
0
24 Feb 2024
Vision-Language Navigation with Embodied Intelligence: A Survey
Peng Gao
Peng Wang
Feng Gao
Fei-Yue Wang
Ruyue Yuan
LM&Ro
43
2
0
22 Feb 2024
SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
Wonjoong Kim
S. Park
Yeonjun In
Seokwon Han
Chanyoung Park
LRM
ReLM
32
3
0
22 Feb 2024
WinoViz: Probing Visual Properties of Objects Under Different States
Woojeong Jin
Tejas Srinivasan
Jesse Thomason
Xiang Ren
33
1
0
21 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
37
4
0
14 Feb 2024
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation
Zhengbo Wang
Jian Liang
Lijun Sheng
Ran He
Zilei Wang
Tieniu Tan
VLM
32
22
0
06 Feb 2024
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
30
5
0
30 Jan 2024
Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks
Yuliang Cai
Mohammad Rostami
33
4
0
27 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
22
7
0
25 Jan 2024
LanDA: Language-Guided Multi-Source Domain Adaptation
Zhenbin Wang
Lei Zhang
Lituan Wang
Minjuan Zhu
35
10
0
25 Jan 2024
Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection
Fatma Shalabi
Hichem Felouat
H. Nguyen
Isao Echizen
MLLM
33
3
0
22 Jan 2024
Seeing the Unseen: Visual Common Sense for Semantic Placement
Ram Ramrakhya
Aniruddha Kembhavi
Dhruv Batra
Z. Kira
Kuo-Hao Zeng
Luca Weihs
VLM
41
5
0
15 Jan 2024
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang
Bohan Zhuang
Qi Wu
14
11
0
12 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
34
0
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
21
2
0
09 Jan 2024
Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
Sibo Wang
Jie Zhang
Zheng Yuan
Shiguang Shan
VLM
36
18
0
09 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
43
8
0
06 Jan 2024
Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training
Jiuming Qin
Che Liu
Sibo Cheng
Yike Guo
Rossella Arcucci
VLM
MedIm
25
5
0
02 Jan 2024
Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
Siyuan Li
Luyuan Zhang
Zedong Wang
Di Wu
Lirong Wu
...
Jun Xia
Cheng Tan
Yang Liu
Baigui Sun
Stan Z. Li
SSL
39
14
0
31 Dec 2023
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang
Jiajun Deng
Mingbo Jia
ObjD
42
7
0
23 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
48
29
0
19 Dec 2023
Mask Grounding for Referring Image Segmentation
Yong Xien Chng
Henry Zheng
Yizeng Han
Xuchong Qiu
Gao Huang
ISeg
ObjD
37
15
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
38
24
0
19 Dec 2023
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu
Xinyun Zhang
Peng Xu
Peiyu Liao
Xufeng Yao
Bei Yu
VLM
19
0
0
17 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
G. Loaiza-Ganem
M. Volkovs
51
3
0
15 Dec 2023
Text-Guided Face Recognition using Multi-Granularity Cross-Modal Contrastive Learning
Md Golam Moula Mehedi Hasan
S. Sami
Nasser M. Nasrabadi
26
4
0
14 Dec 2023
Domain Prompt Learning with Quaternion Networks
Qinglong Cao
Zhengqin Xu
Yuntian Chen
Chao Ma
Xiaokang Yang
VLM
39
10
0
12 Dec 2023
Multimodal Pretraining of Medical Time Series and Notes
Ryan N. King
Tianbao Yang
Bobak J. Mortazavi
25
12
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
13
4
0
11 Dec 2023
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
Haicheng Liao
Huanming Shen
Zhenning Li
Chengyue Wang
Guofa Li
Yiming Bie
Chengzhong Xu
39
50
0
06 Dec 2023
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
Cong-Duy Nguyen
The-Anh Vu-Le
Thong Nguyen
Tho Quan
A. Luu
28
5
0
04 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
46
85
0
01 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
15
1
1
30 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
38
0
0
28 Nov 2023
Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
Yifei Chen
Dapeng Chen
Ruijin Liu
Sai Zhou
Wenyuan Xue
Wei Peng
33
6
0
27 Nov 2023
Open-Vocabulary Camouflaged Object Segmentation
Youwei Pang
Xiaoqi Zhao
Jiaming Zuo
Lihe Zhang
Huchuan Lu
VLM
ObjD
31
6
0
19 Nov 2023
Learning Mutually Informed Representations for Characters and Subwords
Yilin Wang
Xinyi Hu
Matthew R. Gormley
36
0
0
14 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Cheng Yang
Rui Xu
Ye Guo
Peixiang Huang
Yiru Chen
Wenkui Ding
Zhongyuan Wang
Hong Zhou
LRM
21
5
0
09 Nov 2023
Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction
Zacharias Anastasakis
Dimitrios Mallis
Markos Diomataris
George Alexandridis
Stefanos D. Kollias
Vassilis Pitsikalis
29
2
0
08 Nov 2023
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models
Jingru Yi
Burak Uzkent
Oana Ignat
Zili Li
Amanmeet Garg
Xiang Yu
Linda Liu
VLM
38
1
0
05 Nov 2023
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval
Junkyu Jang
Eugene Hwang
Sung-Hyuk Park
28
0
0
03 Nov 2023
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition
Guangyue Xu
Parisa Kordjamshidi
Joyce Chai
24
2
0
02 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
43
36
0
01 Nov 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
54
2
0
30 Oct 2023
Generating Context-Aware Natural Answers for Questions in 3D Scenes
Mohammed Munzer Dwedari
Matthias Niessner
Dave Zhenyu Chen
27
1
0
30 Oct 2023
Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting
Hejie Cui
Xinyu Fang
Zihan Zhang
Ran Xu
Xuan Kan
Xin Liu
Yue Yu
Manling Li
Yangqiu Song
Carl Yang
VLM
28
4
0
28 Oct 2023
RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments
Mengxue Qu
Yu-Huan Wu
Wu Liu
Xiaodan Liang
Jingkuan Song
Yao-Min Zhao
Yunchao Wei
19
15
0
26 Oct 2023
Previous
1
2
3
4
5
6
...
19
20
21
Next