ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,119 papers shown
Title
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
  Navigation
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
Yibo Cui
Liang Xie
Yakun Zhang
Meishan Zhang
Ye Yan
Erwei Yin
LM&Ro
87
17
0
24 Aug 2023
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt
  interaction tasks
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
Zichao Dong
Weikun Zhang
Xufeng Huang
Hang Ji
Xin Zhan
Junbo Chen
VLM
47
4
0
24 Aug 2023
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text
  Retrieval
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval
Yuan. Yuan
Yangfan Zhan
Zhitong Xiong
VLM
87
47
0
24 Aug 2023
CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No
CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No
Hualiang Wang
Yi Li
Huifeng Yao
Xuelong Li
VLMOODD
135
108
0
23 Aug 2023
Cross-Modality Proposal-guided Feature Mining for Unregistered
  RGB-Thermal Pedestrian Detection
Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection
Chao Tian
Zikun Zhou
Yuqing Huang
Gaojun Li
Zhenyu He
78
9
0
23 Aug 2023
Multi-event Video-Text Retrieval
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
85
14
0
22 Aug 2023
Unsupervised Prototype Adapter for Vision-Language Models
Unsupervised Prototype Adapter for Vision-Language Models
Yi Zhang
Ce Zhang
Xue-mei Hu
Z. He
VLM
79
4
0
22 Aug 2023
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts
Bilel Benjdira
Anis Koubaa
Anas M. Ali
LM&Ro
62
4
0
22 Aug 2023
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal
  Heterogeneous Federated Learning
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Haokun Chen
Yao Zhang
Denis Krompass
Jindong Gu
Volker Tresp
FedML
119
55
0
21 Aug 2023
An Empirical Study of CLIP for Text-based Person Search
An Empirical Study of CLIP for Text-based Person Search
Min Cao
Yang Bai
Ziyin Zeng
Mang Ye
Min Zhang
VLM
124
48
0
19 Aug 2023
Whether you can locate or not? Interactive Referring Expression
  Generation
Whether you can locate or not? Interactive Referring Expression Generation
Fulong Ye
Yuxing Long
Fangxiang Feng
Xiaojie Wang
74
4
0
19 Aug 2023
EAVL: Explicitly Align Vision and Language for Referring Image
  Segmentation
EAVL: Explicitly Align Vision and Language for Referring Image Segmentation
Yimin Yan
Xingjian He
Wenxuan Wang
Sihan Chen
Qingbin Liu
ObjDVLM
66
2
0
18 Aug 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
  Models
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
111
12
0
18 Aug 2023
Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud
  Recognition
Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition
Xuanyu Yi
Jiajun Deng
Qianru Sun
Xiansheng Hua
J. Lim
Hanwang Zhang
3DPC
63
14
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language
  Representation Learning
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
76
1
0
18 Aug 2023
Diffusion Models for Image Restoration and Enhancement -- A
  Comprehensive Survey
Diffusion Models for Image Restoration and Enhancement -- A Comprehensive Survey
Xin Li
Yulin Ren
Xin Jin
Cuiling Lan
Xingyu Wang
Wenjun Zeng
Xinchao Wang
Zhibo Chen
108
87
0
18 Aug 2023
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme
  Detection
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
Rui Cao
Ming Shan Hee
Adriel Kuek
Wen-Haw Chong
Roy Ka-wei Lee
Jing Jiang
VLMMLLM
56
43
0
16 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
77
23
0
15 Aug 2023
ICAFusion: Iterative Cross-Attention Guided Feature Fusion for
  Multispectral Object Detection
ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection
Jifeng Shen
Yifei Chen
Yue Liu
Xin Zuo
Heng Fan
Wankou Yang
ViT
67
118
0
15 Aug 2023
MM-GEF: Multi-modal representation meet collaborative filtering
MM-GEF: Multi-modal representation meet collaborative filtering
Hao Wu
Alejandro Ariza-Casabona
Bartlomiej Twardowski
Tri Kurniawan Wijaya
49
2
0
14 Aug 2023
CTP: Towards Vision-Language Continual Pretraining via Compatible
  Momentum Contrast and Topology Preservation
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Hongguang Zhu
Yunchao Wei
Xiaodan Liang
Chunjie Zhang
Yao-Min Zhao
VLM
72
30
0
14 Aug 2023
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal
  Contrastive Learning
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
Ziqi Zhou
Shengshan Hu
Minghui Li
Hangtao Zhang
Yechao Zhang
Hai Jin
AAML
131
75
0
14 Aug 2023
Improving Face Recognition from Caption Supervision with Multi-Granular
  Contextual Feature Aggregation
Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation
Md Golam Moula Mehedi Hasan
Nasser M. Nasrabadi
CVBM
47
2
0
13 Aug 2023
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
Xuehan Bai
Yan Li
Yong Cheng
Wenjie Yang
Quanming Chen
Han Li
66
4
0
10 Aug 2023
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Ruitao Liu
Xiaohan Wang
Wenguan Wang
Yi Yang
125
57
0
09 Aug 2023
Pareto Invariant Representation Learning for Multimedia Recommendation
Pareto Invariant Representation Learning for Multimedia Recommendation
Shanshan Huang
Haoxuan Li
Qingsong Li
Chunyuan Zheng
Li Liu
CML
94
12
0
09 Aug 2023
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
Ziyu Zhu
Xiaojian Ma
Yixin Chen
Zhidong Deng
Siyuan Huang
Qing Li
LM&Ro
85
123
0
08 Aug 2023
Beyond First Impressions: Integrating Joint Multi-modal Cues for
  Comprehensive 3D Representation
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Haowei Wang
Jiji Tang
Jiayi Ji
Xiaoshuai Sun
Rongsheng Zhang
...
Minda Zhao
Lincheng Li
zeng zhao
Tangjie Lv
Rongrong Ji
3DV
102
16
0
06 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
182
721
0
04 Aug 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
  Complex Visual Reasoning Tasks
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
58
2
0
31 Jul 2023
Open-Set Domain Adaptation with Visual-Language Foundation Models
Open-Set Domain Adaptation with Visual-Language Foundation Models
Qing Yu
Go Irie
Kiyoharu Aizawa
VLM
111
7
0
30 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMeMLLM
126
46
0
30 Jul 2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
  Control
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan
Noah Brown
Justice Carbajal
Yevgen Chebotar
Xi Chen
...
Ted Xiao
Peng Xu
Sichun Xu
Tianhe Yu
Brianna Zitkovich
LM&RoLRM
255
1,297
0
28 Jul 2023
Cross-Modal Concept Learning and Inference for Vision-Language Models
Cross-Modal Concept Learning and Inference for Vision-Language Models
Yi Zhang
Ce Zhang
Yushun Tang
Z. He
VLMMLLMCLIP
81
16
0
28 Jul 2023
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
Yifei Xin
Yuexian Zou
121
9
0
28 Jul 2023
MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained
  Semantic Classes and Hard Negative Entities
MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Classes and Hard Negative Entities
Yongqian Li
Tingwei Lu
Hai-Tao Zheng
Tianyu Yu
Shulin Huang
Haitao Zheng
Rui Zhang
Jun Yuan
97
11
0
27 Jul 2023
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Kun Yuan
V. Srivastav
Tong Yu
Joël L. Lavanchy
J. Marescaux
Pietro Mascagni
Nassir Navab
N. Padoy
201
23
0
27 Jul 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
62
5
0
26 Jul 2023
When Multi-Task Learning Meets Partial Supervision: A Computer Vision
  Review
When Multi-Task Learning Meets Partial Supervision: A Computer Vision Review
Maxime Fontana
Michael W. Spratling
Miaojing Shi
87
7
0
25 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
148
128
0
25 Jul 2023
Spectrum-guided Multi-granularity Referring Video Object Segmentation
Spectrum-guided Multi-granularity Referring Video Object Segmentation
Bo Miao
Bennamoun
Yongsheng Gao
Ajmal Mian
VOS
97
41
0
25 Jul 2023
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for
  Navigation Instruction Generation
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation
Haitian Zeng
Xiaohan Wang
Wenguan Wang
Yi Yang
80
7
0
25 Jul 2023
Towards a Visual-Language Foundation Model for Computational Pathology
Towards a Visual-Language Foundation Model for Computational Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Ivy Liang
...
Andrew Zhang
L. Le
Georg Gerber
Anil V. Parwani
Faisal Mahmood
VLMMedIm
110
46
0
24 Jul 2023
PRIOR: Prototype Representation Joint Learning from Medical Images and
  Reports
PRIOR: Prototype Representation Joint Learning from Medical Images and Reports
Pujin Cheng
Li Lin
Junyan Lyu
Yijin Huang
Wenhan Luo
Xiaoying Tang
MedIm
142
51
0
24 Jul 2023
Learning Vision-and-Language Navigation from YouTube Videos
Learning Vision-and-Language Navigation from YouTube Videos
Kun-Li Channing Lin
Peihao Chen
Di Huang
Thomas H. Li
Mingkui Tan
Chuang Gan
LM&Ro
95
27
0
22 Jul 2023
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
  Referring Image Segmentation
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
Zunnan Xu
Zhihong Chen
Yong Zhang
Yibing Song
Xiang Wan
Guanbin Li
VLM
82
50
0
21 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future
  Challenges
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
Findings of Factify 2: Multimodal Fake News Detection
Findings of Factify 2: Multimodal Fake News Detection
S. Suryavardan
Shreyash Mishra
Megha Chakraborty
Parth Patwa
Anku Rani
...
Amitava Das
Amit P. Sheth
Manoj Kumar Chinnakotla
Asif Ekbal
Srijan Kumar
78
14
0
19 Jul 2023
Towards a performance analysis on pre-trained Visual Question Answering
  models for autonomous driving
Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
Kaavya Rekanar
Ciarán Eising
Ganesh Sistu
Martin Hayes
25
3
0
18 Jul 2023
Multi-Modal Discussion Transformer: Integrating Text, Images and Graph
  Transformers to Detect Hate Speech on Social Media
Multi-Modal Discussion Transformer: Integrating Text, Images and Graph Transformers to Detect Hate Speech on Social Media
Liam Hebert
Gaurav Sahu
Yuxuan Guo
Nanda Kishore Sreenivas
Lukasz Golab
Robin Cohen
65
11
0
18 Jul 2023
Previous
123...121314...414243
Next