Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 1,513 papers shown
Title
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning
Q. Si
Yuanxin Liu
Fandong Meng
Zheng Lin
Peng Fu
Yanan Cao
Weiping Wang
Jie Zhou
46
23
0
10 Oct 2022
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Zijia Zhao
Longteng Guo
Xingjian He
Shuai Shao
Zehuan Yuan
Jing Liu
21
9
0
09 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
34
21
0
09 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
29
2
0
08 Oct 2022
Retrieval Augmented Visual Question Answering with Outside Knowledge
Weizhe Lin
Bill Byrne
RALM
74
71
0
07 Oct 2022
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Su Wang
Jing Yu Koh
Alexander Ku
Austin Waters
Yinfei Yang
Jason Baldridge
Zarana Parekh
LM&Ro
27
45
0
06 Oct 2022
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training
Tianyu Huang
Bowen Dong
Yunhan Yang
Xiaoshui Huang
Rynson W. H. Lau
Wanli Ouyang
W. Zuo
VLM
3DPC
CLIP
44
144
0
03 Oct 2022
A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering
Xiaofei Huang
Hongfang Gong
MedIm
66
12
0
01 Oct 2022
Domain-Unified Prompt Representations for Source-Free Domain Generalization
Hongjing Niu
Hanting Li
Feng Zhao
Bin Li
VLM
74
18
0
29 Sep 2022
Domain-aware Self-supervised Pre-training for Label-Efficient Meme Analysis
Shivam Sharma
Mohd Khizir Siddiqui
Md. Shad Akhtar
Tanmoy Chakraborty
SSL
36
5
0
29 Sep 2022
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
63
28
0
28 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
30
25
0
28 Sep 2022
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
33
33
0
27 Sep 2022
RepsNet: Combining Vision with Language for Automated Medical Reports
A. Tanwani
Joelle Barral
Daniel Freedman
MedIm
53
20
0
27 Sep 2022
LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation
Yue Zhang
Parisa Kordjamshidi
40
11
0
26 Sep 2022
Visual representations in the human brain are aligned with large language models
Adrien Doerig
Tim C Kietzmann
Emily J. Allen
Yihan Wu
Thomas Naselaris
Kendrick Norris Kay
I. Charest
45
23
0
23 Sep 2022
Unsupervised Hashing with Semantic Concept Mining
Rong-Cheng Tu
Xian-Ling Mao
Kevin Qinghong Lin
Chengfei Cai
Weize Qin
Hongfa Wang
Wei Wei
Heyan Huang
65
10
0
23 Sep 2022
LGDN: Language-Guided Denoising Network for Video-Language Modeling
Haoyu Lu
Mingyu Ding
Nanyi Fei
Yuqi Huo
Zhiwu Lu
VLM
91
16
0
23 Sep 2022
The Ability of Image-Language Explainable Models to Resemble Domain Expertise
P. Werner
Anna Zapaishchykova
Ujjwal Ratan
56
2
0
19 Sep 2022
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Lovisa Hagström
Richard Johansson
VLM
43
4
0
19 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
38
149
0
15 Sep 2022
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Zhihong Chen
Guanbin Li
Xiang Wan
127
66
0
15 Sep 2022
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training
Zhihong Chen
Yu Du
Jinpeng Hu
Yang Liu
Guanbin Li
Xiang Wan
Tsung-Hui Chang
97
111
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
31
8
0
14 Sep 2022
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Tianlin Li
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
37
691
0
14 Sep 2022
ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining
Zhexiong Liu
M. Guo
Y. Dai
Diane Litman
32
15
0
14 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
145
29
0
12 Sep 2022
VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models
Felix Vogel
Nina Shvetsova
Leonid Karlinsky
Hilde Kuehne
VLM
63
7
0
12 Sep 2022
Instruction-driven history-aware policies for robotic manipulations
Pierre-Louis Guhur
Shizhe Chen
Ricardo Garcia Pinel
Makarand Tapaswi
Ivan Laptev
Cordelia Schmid
LM&Ro
113
102
0
11 Sep 2022
Pre-training image-language transformers for open-vocabulary tasks
A. Piergiovanni
Weicheng Kuo
A. Angelova
VLM
ViT
47
9
0
09 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
26
63
0
07 Sep 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
35
64
0
04 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
26
27
0
29 Aug 2022
Disentangle and Remerge: Interventional Knowledge Distillation for Few-Shot Object Detection from A Conditional Causal Perspective
Jiangmeng Li
Yanan Zhang
Jingyao Wang
Hui Xiong
Chengbo Jiao
Xiaohui Hu
Changwen Zheng
Gang Hua
CML
55
28
0
26 Aug 2022
AiM: Taking Answers in Mind to Correct Chinese Cloze Tests in Educational Applications
Yusen Zhang
Zhongli Li
Qingyu Zhou
Ziyi Liu
Chao Li
Mina W. Ma
Yunbo Cao
Hongzhi Liu
11
1
0
26 Aug 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
59
160
0
25 Aug 2022
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Chenhao Cui
Xinnian Liang
Shuangzhi Wu
Zhoujun Li
44
3
0
24 Aug 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Yanbei Chen
Massimiliano Mancini
Xiatian Zhu
Zeynep Akata
52
115
0
24 Aug 2022
FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang
A. Mahjoubfar
Anupama Joshi
29
4
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
41
0
0
23 Aug 2022
Revising Image-Text Retrieval via Multi-Modal Entailment
Xu Yan
Chunhui Ai
Ziqiang Cao
Min Cao
Sujian Li
Wen-Yi Chen
Guohong Fu
33
0
0
22 Aug 2022
SPOT: Knowledge-Enhanced Language Representations for Information Extraction
Jiacheng Li
Yannis Katsis
Tyler Baldwin
Ho-Cheol Kim
Andrew Bartko
Julian McAuley
Chun-Nan Hsu
32
15
0
20 Aug 2022
Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork
Xin Yuan
Zhe Lin
Jason Kuen
Jianming Zhang
John Collomosse
42
5
0
17 Aug 2022
What Artificial Neural Networks Can Tell Us About Human Language Acquisition
Alex Warstadt
Samuel R. Bowman
32
112
0
17 Aug 2022
MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting
Pengyu Chen
Junyuan Gao
Yuan. Yuan
Qi. Wang
27
6
0
14 Aug 2022
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2
Xinghui Zhou
Xin Jin
Jianwen Lv
Heng Huang
Ming Mao
Shuai Cui
CoGe
21
0
0
09 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
23
11
0
08 Aug 2022
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang
Feiya Lv
Ting Yao
Yiming Yuan
Jin Ma
Yu Luo
Haijin Liang
31
3
0
05 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLM
LRM
VPVLM
37
30
0
04 Aug 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
Juncheng Li
Xin He
Longhui Wei
Long Qian
Linchao Zhu
Lingxi Xie
Yueting Zhuang
Qi Tian
Siliang Tang
VLM
41
79
0
04 Aug 2022
Previous
1
2
3
...
15
16
17
...
29
30
31
Next