Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,094 papers shown
Title
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Xingning Dong
Tian Gan
Xuemeng Song
Jianlong Wu
Yuan Cheng
Liqiang Nie
29
92
0
18 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
34
74
0
18 Mar 2022
Finding Structural Knowledge in Multimodal-BERT
Victor Milewski
Miryam de Lhoneux
Marie-Francine Moens
29
9
0
17 Mar 2022
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Yang Ding
Jing Yu
Bangchang Liu
Yue Hu
Mingxin Cui
Qi Wu
15
63
0
17 Mar 2022
ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps
Jizhou Huang
Haifeng Wang
Yibo Sun
Yunsheng Shi
Zhengjie Huang
An Zhuo
Shikun Feng
33
45
0
17 Mar 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
19
21
0
17 Mar 2022
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
Luyang Huang
Guocheng Niu
Jiachen Liu
Xinyan Xiao
Hua Wu
VLM
CoGe
19
7
0
17 Mar 2022
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Haojun Jiang
Yuanze Lin
Dongchen Han
Shiji Song
Gao Huang
ObjD
51
51
0
16 Mar 2022
Modular and Parameter-Efficient Multimodal Fusion with Prompting
Sheng Liang
Mengjie Zhao
Hinrich Schütze
38
42
0
15 Mar 2022
Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs
Taichi Iki
Akiko Aizawa
LLMAG
21
6
0
15 Mar 2022
CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
Carlos E. Jimenez
Olga Russakovsky
Karthik Narasimhan
CoGe
36
14
0
15 Mar 2022
Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations
Robert Wolfe
Aylin Caliskan
VLM
25
13
0
14 Mar 2022
All in One: Exploring Unified Video-Language Pre-training
Alex Jinpeng Wang
Yixiao Ge
Rui Yan
Yuying Ge
Xudong Lin
Guanyu Cai
Jianping Wu
Ying Shan
Xiaohu Qie
Mike Zheng Shou
43
200
0
14 Mar 2022
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
Haoyu Song
Li Dong
Weinan Zhang
Ting Liu
Furu Wei
VLM
CLIP
33
137
0
14 Mar 2022
HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing
Yanzhao Zheng
Haibin Wang
B. Dong
Xingjun Wang
Changshan Li
35
32
0
14 Mar 2022
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
VLM
29
90
0
12 Mar 2022
Differentiated Relevances Embedding for Group-based Referring Expression Comprehension
Fuhai Chen
Xuri Ge
Xiaoshuai Sun
Yue Gao
Jianzhuang Liu
Feiyue Huang
Rongrong Ji
32
0
0
12 Mar 2022
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
Tianlong Chen
Zhenyu Zhang
Yu Cheng
Ahmed Hassan Awadallah
Zhangyang Wang
ViT
46
37
0
12 Mar 2022
REX: Reasoning-aware and Grounded Explanation
Shi Chen
Qi Zhao
30
18
0
11 Mar 2022
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
Jie Lei
Xinlei Chen
Ning Zhang
Meng-xing Wang
Joey Tianyi Zhou
Tamara L. Berg
Licheng Yu
39
12
0
10 Mar 2022
Cross-modal Map Learning for Vision and Language Navigation
G. Georgakis
Karl Schmeckpeper
Karan Wanchoo
Soham Dan
E. Miltsakaki
Dan Roth
Kostas Daniilidis
27
64
0
10 Mar 2022
Towards Inadequately Pre-trained Models in Transfer Learning
Andong Deng
Xingjian Li
Di Hu
Tianyang Wang
Haoyi Xiong
Chengzhong Xu
19
6
0
09 Mar 2022
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration
Xiwen Liang
Fengda Zhu
Lingling Li
Hang Xu
Xiaodan Liang
LM&Ro
VLM
36
29
0
08 Mar 2022
Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
Chuhui Xue
Wenqing Zhang
Yu Hao
Shijian Lu
Philip Torr
Song Bai
VLM
45
32
0
08 Mar 2022
Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval
Jun Rao
Fei Wang
Liang Ding
Shuhan Qi
Yibing Zhan
Weifeng Liu
Dacheng Tao
OOD
47
28
0
08 Mar 2022
Image Search with Text Feedback by Additive Attention Compositional Learning
Yuxin Tian
Shawn D. Newsam
K. Boakye
CoGe
32
11
0
08 Mar 2022
Modeling Coreference Relations in Visual Dialog
Mingxiao Li
Marie-Francine Moens
19
9
0
06 Mar 2022
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding
Daizong Liu
Xiang Fang
Wei Hu
Pan Zhou
32
37
0
06 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
24
37
0
03 Mar 2022
LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives
Danial Maleki
H. R Tizhoosh
MedIm
22
10
0
02 Mar 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
19
29
0
02 Mar 2022
Recent, rapid advancement in visual question answering architecture: a review
V. Kodali
Daniel Berleant
47
9
0
02 Mar 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Mingyang Zhou
Licheng Yu
Amanpreet Singh
Mengjiao MJ Wang
Zhou Yu
Ning Zhang
VLM
35
31
0
01 Mar 2022
Multi-modal Alignment using Representation Codebook
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
36
66
0
28 Feb 2022
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding
Jiapeng Wang
Lianwen Jin
Kai Ding
VLM
35
140
0
28 Feb 2022
SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following
Ruinian Xu
Hongyi Chen
Yunzhi Lin
Patricio A. Vela
27
6
0
25 Feb 2022
Joint Answering and Explanation for Visual Commonsense Reasoning
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Yin-wei Wei
Liqiang Nie
Mohan S. Kankanhalli
32
16
0
25 Feb 2022
Measuring CLEVRness: Blackbox testing of Visual Reasoning Models
Spyridon Mouselinos
Henryk Michalewski
Mateusz Malinowski
32
3
0
24 Feb 2022
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
LM&Ro
36
139
0
23 Feb 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
Jiarui Xu
Shalini De Mello
Sifei Liu
Wonmin Byeon
Thomas Breuel
Jan Kautz
Xinyu Wang
ViT
VLM
200
504
0
22 Feb 2022
COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems
Shuang Ma
Sai H. Vemprala
Wenshan Wang
Jayesh K. Gupta
Yale Song
Daniel J. McDuff
Ashish Kapoor
SSL
37
9
0
20 Feb 2022
A Survey of Vision-Language Pre-Trained Models
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
47
180
0
18 Feb 2022
AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint Modeling Method for Multi-task Misogyny Identification
Da Li
Ming Yi
Yukai He
11
1
0
18 Feb 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
82
213
0
18 Feb 2022
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Oana Ignat
Santiago Castro
Yuhang Zhou
Jiajun Bao
Dandan Shan
Rada Mihalcea
24
3
0
16 Feb 2022
XFBoost: Improving Text Generation with Controllable Decoders
Xiangyu Peng
Michael Sollami
30
1
0
16 Feb 2022
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
Youwei Liang
Chongjian Ge
Zhan Tong
Yibing Song
Jue Wang
P. Xie
ViT
25
238
0
16 Feb 2022
Privacy Preserving Visual Question Answering
Cristian-Paul Bara
Q. Ping
Abhinav Mathur
Govind Thattai
M. Rohith
Gaurav Sukhatme
9
1
0
15 Feb 2022
ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer
Kohei Uehara
Yusuke Mori
Yusuke Mukuta
Tatsuya Harada
35
6
0
15 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
33
39
0
15 Feb 2022
Previous
1
2
3
...
27
28
29
...
40
41
42
Next