Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.16934
Cited By
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
30 June 2020
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph"
50 / 208 papers shown
Title
Multimodal Graph Transformer for Multimodal Question Answering
Xuehai He
Xin Eric Wang
36
7
0
30 Apr 2023
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining
Bingqian Lin
Zicong Chen
Mingjie Li
Haokun Lin
Hang Xu
...
Ling-Hao Chen
Xiaojun Chang
Yi Yang
L. Xing
Xiaodan Liang
LM&MA
MedIm
AI4CE
40
14
0
26 Apr 2023
Rethinking Benchmarks for Cross-modal Image-text Retrieval
Wei Chen
Linli Yao
Qin Jin
VLM
16
18
0
21 Apr 2023
Learning Situation Hyper-Graphs for Video Question Answering
Aisha Urooj Khan
Hilde Kuehne
Bo Wu
Kim Chheu
Walid Bousselham
Chuang Gan
N. Lobo
M. Shah
34
15
0
18 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
16
1
0
10 Apr 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
38
51
0
06 Apr 2023
G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System
Lixia Wu
Jianlin Liu
Junhong Lou
Haoyuan Hu
Jianbin Zheng
Haomin Wen
Chao Song
Shu He
VLM
25
4
0
04 Apr 2023
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Sifan Long
Zhen Zhao
Junkun Yuan
Zichang Tan
Jiangjiang Liu
Luping Zhou
Sheng-sheng Wang
Jingdong Wang
VLM
25
2
0
30 Mar 2023
Just Noticeable Visual Redundancy Forecasting: A Deep Multimodal-driven Approach
Wuyuan Xie
Shukang Wang
Sukun Tian
Lirong Huang
Ye Liu
Miaohui Wang
11
3
0
18 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
31
202
0
20 Feb 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Zhihong Chen
Shizhe Diao
Benyou Wang
Guanbin Li
Xiang Wan
MedIm
22
29
0
17 Feb 2023
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval
Yansong Gong
Georgina Cosma
Axel Finke
ViT
30
2
0
13 Feb 2023
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
31
26
0
01 Feb 2023
Learning the Effects of Physical Actions in a Multi-modal Environment
Gautier Dagan
Frank Keller
A. Lascarides
LM&Ro
32
3
0
27 Jan 2023
HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images
Kun Li
G. Vosselman
M. Yang
25
5
0
23 Jan 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Floris Weers
Vaishaal Shankar
Angelos Katharopoulos
Yinfei Yang
Tom Gunter
CLIP
23
4
0
19 Jan 2023
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology
Chaoyi Wu
Xiaoman Zhang
Ya-Qin Zhang
Yanfeng Wang
Weidi Xie
LM&MA
VLM
30
109
0
05 Jan 2023
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Rohan Pandey
Rulin Shao
Paul Pu Liang
Ruslan Salakhutdinov
Louis-Philippe Morency
26
12
0
20 Dec 2022
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Siyi Liu
Yaoyuan Liang
Feng Li
Shijia Huang
Hao Zhang
Hang Su
Jun Zhu
Lei Zhang
ObjD
50
25
0
28 Nov 2022
Deep representation learning: Fundamentals, Perspectives, Applications, and Open Challenges
K. T. Baghaei
Amirreza Payandeh
Pooya Fayyazsanavi
Shahram Rahimi
Zhiqian Chen
Somayeh Bakhtiari Ramezani
FaML
AI4TS
32
6
0
27 Nov 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
41
13
0
19 Nov 2022
Text-Aware Dual Routing Network for Visual Question Answering
Luoqian Jiang
Yifan He
Jian Chen
21
0
0
17 Nov 2022
YORO -- Lightweight End to End Visual Grounding
Chih-Hui Ho
Srikar Appalaraju
Bhavan A. Jasani
R. Manmatha
Nuno Vasconcelos
ObjD
21
21
0
15 Nov 2022
A Survey of Knowledge Enhanced Pre-trained Language Models
Linmei Hu
Zeyi Liu
Ziwang Zhao
Lei Hou
Liqiang Nie
Juanzi Li
KELM
VLM
24
121
0
11 Nov 2022
Masked Vision-Language Transformers for Scene Text Recognition
Jie Wu
Ying Peng
Shenmin Zhang
Weigang Qi
Jian Zhang
32
3
0
09 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua-Hong Wu
Haifeng Wang
VLM
21
1
0
07 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
75
12
0
28 Oct 2022
UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance
Wei Li
Xue Xu
Xinyan Xiao
Jiacheng Liu
Hu Yang
...
Zhanpeng Wang
Zhifan Feng
Qiaoqiao She
Yajuan Lyu
Hua-Hong Wu
121
29
0
28 Oct 2022
Learning Joint Representation of Human Motion and Language
Jihoon Kim
Youngjae Yu
Seungyoung Shin
Taehyun Byun
Sungjoon Choi
23
5
0
27 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
T. Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
16
6
0
24 Oct 2022
Video based Object 6D Pose Estimation using Transformers
Apoorva Beedu
Huda AlAmri
Irfan Essa
ViT
16
8
0
24 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
27
6
0
24 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLM
CLIP
27
43
0
17 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLM
VLM
29
62
0
14 Oct 2022
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
Qiming Peng
Yinxu Pan
Wenjin Wang
Bin Luo
Zhenyu Zhang
...
Shi Feng
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
13
83
0
12 Oct 2022
Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features
Gokul Karthik Kumar
Karthik Nandakumar
VLM
CLIP
27
56
0
12 Oct 2022
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach
Georgios Tziafas
H. Kasaei
LM&Ro
20
3
0
03 Oct 2022
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan
Weichong Yin
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
VLM
22
19
0
30 Sep 2022
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Zhihong Chen
Guanbin Li
Xiang Wan
124
65
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
26
8
0
14 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
24
27
0
29 Aug 2022
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Chenhao Cui
Xinnian Liang
Shuangzhi Wu
Zhoujun Li
41
3
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
38
0
0
23 Aug 2022
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
34
13
0
17 Aug 2022
CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning
Adam Dahlgren Lindström
Savitha Sam Abraham
14
47
0
10 Aug 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian
Siliang Tang
VLM
35
79
0
04 Aug 2022
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding
Mengxue Qu
Yu Wu
Wu Liu
Qiqi Gong
Xiaodan Liang
Olga Russakovsky
Yao Zhao
Yunchao Wei
ObjD
11
22
0
27 Jul 2022
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
Zhuo Chen
Yufen Huang
Jiaoyan Chen
Yuxia Geng
Yin Fang
Jeff Z. Pan
Ningyu Zhang
Wen Zhang
26
35
0
26 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
25
269
0
15 Jul 2022
Previous
1
2
3
4
5
Next