ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,119 papers shown
Title
Movie Box Office Prediction With Self-Supervised and Visually Grounded
  Pretraining
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining
Qin Chao
Eunsoo Kim
Boyang Albert Li
57
1
0
20 Apr 2023
Is Cross-modal Information Retrieval Possible without Training?
Is Cross-modal Information Retrieval Possible without Training?
Hyunjin Choi
HyunJae Lee
Seongho Joe
Youngjune Gwon
49
1
0
20 Apr 2023
Learning Robust Visual-Semantic Embedding for Generalizable Person
  Re-identification
Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification
Suncheng Xiang
Jingsheng Gao
Mengyuan Guan
Jiacheng Ruan
Chengfeng Zhou
Ting Liu
Xiaobo Li
Yuzhuo Fu
88
5
0
19 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
63
13
0
18 Apr 2023
Learning Situation Hyper-Graphs for Video Question Answering
Learning Situation Hyper-Graphs for Video Question Answering
Aisha Urooj Khan
Hilde Kuehne
Bo Wu
Kim Chheu
Walid Bousselham
Chuang Gan
N. Lobo
M. Shah
92
16
0
18 Apr 2023
Grounding Classical Task Planners via Vision-Language Models
Grounding Classical Task Planners via Vision-Language Models
Xiaohan Zhang
Yan Ding
S. Amiri
Hao Yang
Andy Kaminski
Chad Esselink
Shiqi Zhang
83
17
0
17 Apr 2023
Pretrained Language Models as Visual Planners for Human Assistance
Pretrained Language Models as Visual Planners for Human Assistance
Dhruvesh Patel
H. Eghbalzadeh
Nitin Kamra
Michael L. Iuzzolino
Unnat Jain
Ruta Desai
LM&Ro
87
25
0
17 Apr 2023
Towards Robust Prompts on Vision-Language Models
Towards Robust Prompts on Vision-Language Models
Jindong Gu
Ahmad Beirami
Xuezhi Wang
Alex Beutel
Philip Torr
Yao Qin
VLMVPVLM
86
8
0
17 Apr 2023
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
C. Xu
Yuhan Zhu
Haocheng Shen
Fengyuan Shi
Boheng Chen
Yixuan Liao
Xiaoxin Chen
Limin Wang
VLM
107
22
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
141
112
0
17 Apr 2023
CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval
Yang Yang
Zhongtian Fu
Xiangyu Wu
Wenjie Li
VLM
70
1
0
15 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
  Segmentation
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
77
11
0
14 Apr 2023
HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
  Recognition
HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition
Soumya Dutta
Sriram Ganapathy
117
18
0
14 Apr 2023
Modeling Dense Multimodal Interactions Between Biological Pathways and
  Histology for Survival Prediction
Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction
Guillaume Jaume
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Paul Pu Liang
Faisal Mahmood
105
51
0
13 Apr 2023
Verbs in Action: Improving verb understanding in video-language models
Verbs in Action: Improving verb understanding in video-language models
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
111
71
0
13 Apr 2023
Road Network Representation Learning: A Dual Graph based Approach
Road Network Representation Learning: A Dual Graph based Approach
Li Zhang
Cheng Long
AI4TSGNN
96
13
0
13 Apr 2023
Efficient Multimodal Fusion via Interactive Prompting
Efficient Multimodal Fusion via Interactive Prompting
Yaowei Li
Ruijie Quan
Linchao Zhu
Yezhou Yang
84
45
0
13 Apr 2023
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with
  Multi-modal Visual Data and Natural Language
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
Zhe Lin
Xidong Peng
Peishan Cong
Ge Zheng
Yujin Sun
Yuenan Hou
Xinge Zhu
Sibei Yang
Yuexin Ma
VGen
148
5
0
12 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal
  representations
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
69
4
0
11 Apr 2023
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion
  Vision-Language Pre-training
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training
Yunpeng Han
Lisai Zhang
Qingcai Chen
Zhijian Chen
Zhonghua Li
Jianxin Yang
Bo Zhao
AI4TSVLM
89
13
0
11 Apr 2023
Improving Vision-and-Language Navigation by Generating Future-View Image
  Semantics
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
Jialu Li
Joey Tianyi Zhou
102
37
0
11 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and
  Language
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIPVLM
65
1
0
10 Apr 2023
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous
  States in Realistic 3D Scenes
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
Ran Gong
Jiangyong Huang
Yizhou Zhao
Haoran Geng
Xiaofeng Gao
...
Ziheng Zhou
D. Terzopoulos
Song-Chun Zhu
Baoxiong Jia
Siyuan Huang
LM&Ro
111
50
0
09 Apr 2023
InstructBio: A Large-scale Semi-supervised Learning Paradigm for
  Biochemical Problems
InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems
Fang Wu
Huiling Qin
Siyuan Li
Stan Z. Li
Xianyuan Zhan
Jinbo Xu
80
5
0
08 Apr 2023
Probing Conceptual Understanding of Large Visual-Language Models
Probing Conceptual Understanding of Large Visual-Language Models
Madeline Chantry Schiappa
Raiyaan Abdullah
Shehreen Azad
Jared Claypoole
Michael Cogswell
Ajay Divakaran
Yogesh S Rawat
81
16
0
07 Apr 2023
Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval
Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval
Jae Myung Kim
A. Sophia Koepke
Cordelia Schmid
Zeynep Akata
130
30
0
06 Apr 2023
Natural Language Robot Programming: NLP integrated with autonomous
  robotic grasping
Natural Language Robot Programming: NLP integrated with autonomous robotic grasping
Muhammad Arshad Khan
Max Kenney
Jack Painter
Disha Kamale
Riza Batista-Navarro
Amir M. Ghalamzan-E.
LM&Ro
68
4
0
06 Apr 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal
  Pretraining in E-commerce
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
78
14
0
06 Apr 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
88
57
0
06 Apr 2023
What's in a Name? Beyond Class Indices for Image Recognition
What's in a Name? Beyond Class Indices for Image Recognition
Kai Han
Yandong Li
S. Vaze
Jie Li
Xuhui Jia
VLM
92
7
0
05 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLMAI4TS
100
6
0
04 Apr 2023
G2PTL: A Pre-trained Model for Delivery Address and its Applications in
  Logistics System
G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System
Lixia Wu
Jianlin Liu
Junhong Lou
Haoyuan Hu
Jianbin Zheng
Haomin Wen
Chao Song
Shu He
VLM
79
5
0
04 Apr 2023
Beyond Unimodal: Generalising Neural Processes for Multimodal
  Uncertainty Estimation
Beyond Unimodal: Generalising Neural Processes for Multimodal Uncertainty Estimation
M. Jung
He Zhao
Joanna Dipnall
Lan Du
UQCVBDL
72
8
0
04 Apr 2023
Multi-modal Fake News Detection on Social Media via Multi-grained
  Information Fusion
Multi-modal Fake News Detection on Social Media via Multi-grained Information Fusion
Yangming Zhou
Yuzhou Yang
Qichao Ying
Zhenxing Qian
Xinpeng Zhang
69
45
0
03 Apr 2023
Multi-Modal Representation Learning with Text-Driven Soft Masks
Multi-Modal Representation Learning with Text-Driven Soft Masks
Jaeyoo Park
Bohyung Han
SSL
58
4
0
03 Apr 2023
Sketch-based Video Object Localization
Sketch-based Video Object Localization
Sangmin Woo
So-Yeong Jeon
Jinyoung Park
Minji Son
Sumin Lee
Changick Kim
120
0
0
02 Apr 2023
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun
Pengchuan Zhang
Peizhao Zhang
Hardik Shah
Kate Saenko
Xide Xia
VLM
109
22
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
127
50
0
31 Mar 2023
Zero-shot Referring Image Segmentation with Global-Local Context
  Features
Zero-shot Referring Image Segmentation with Global-Local Context Features
S. Yu
Paul Hongsuck Seo
Jeany Son
94
53
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
130
9
0
30 Mar 2023
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Sifan Long
Zhen Zhao
Junkun Yuan
Zichang Tan
Jiangjiang Liu
Luping Zhou
Sheng-sheng Wang
Jingdong Wang
VLM
115
3
0
30 Mar 2023
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
  Mask Annotations
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations
VS Vibashan
Ning Yu
Chen Xing
Can Qin
M. Gao
Juan Carlos Niebles
Vishal M. Patel
Ran Xu
VLMISeg
78
18
0
29 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLMVLM
131
25
0
29 Mar 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
138
169
0
28 Mar 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media
  Multimodal Classification
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Chunpu Xu
Jing Li
VLM
62
5
0
27 Mar 2023
Curriculum Learning for Compositional Visual Reasoning
Curriculum Learning for Compositional Visual Reasoning
Wafa Aissa
Marin Ferecatu
M. Crucianu
LRM
87
3
0
27 Mar 2023
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning
Yabin Zhu
Chenglong Li
Tianlin Li
Jin Tang
Zhixiang Huang
94
9
0
26 Mar 2023
Equivariant Similarity for Vision-Language Foundation Models
Equivariant Similarity for Vision-Language Foundation Models
Tan Wang
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Zhengyuan Yang
Hanwang Zhang
Zicheng Liu
Lijuan Wang
CoGe
87
51
0
25 Mar 2023
VILA: Learning Image Aesthetics from User Comments with Vision-Language
  Pretraining
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
Junjie Ke
Keren Ye
Jiahui Yu
Yonghui Wu
P. Milanfar
Feng Yang
VLM
104
61
0
24 Mar 2023
Accelerating Vision-Language Pretraining with Free Language Modeling
Accelerating Vision-Language Pretraining with Free Language Modeling
Teng Wang
Yixiao Ge
Feng Zheng
Ran Cheng
Ying Shan
Xiaohu Qie
Ping Luo
VLMMLLM
123
10
0
24 Mar 2023
Previous
123...161718...414243
Next