Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 1,513 papers shown
Title
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
85
1,265
0
04 May 2022
All You May Need for VQA are Image Captions
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
32
70
0
04 May 2022
Visual Commonsense in Pretrained Unimodal and Multimodal Models
Chenyu Zhang
Benjamin Van Durme
Zhuowan Li
Elias Stengel-Eskin
VLM
SSL
31
39
0
04 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
17
16
0
02 May 2022
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
Cheng Chen
Yudong Zhu
Zhenshan Tan
Qingrong Cheng
Xin Jiang
Qun Liu
X. Gu
31
39
0
01 May 2022
Visual Spatial Reasoning
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
61
161
0
30 Apr 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
53
3,381
0
29 Apr 2022
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Yuting Gao
Jinfeng Liu
Zihan Xu
Jinchao Zhang
Ke Li
Rongrong Ji
Chunhua Shen
VLM
CLIP
34
101
0
29 Apr 2022
Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image Retrieval
Siyu Ren
Kenny Q. Zhu
VLM
30
7
0
29 Apr 2022
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead
Suzanne Petryk
Vedaad Shakib
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
36
54
0
28 Apr 2022
Multi stain graph fusion for multimodal integration in pathology
Chaitanya Dwivedi
Shima Nofallah
Maryam Pouryahya
Janani Iyer
K. Leidal
...
T. Watkins
A. Billin
R. Myers
J. Abel
A. Behrooz
28
18
0
26 Apr 2022
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
Yida Zhao
Yuqing Song
Qin Jin
14
29
0
24 Apr 2022
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
Xiaojian Ma
Weili Nie
Zhiding Yu
Huaizu Jiang
Chaowei Xiao
Yuke Zhu
Song-Chun Zhu
Anima Anandkumar
ViT
LRM
35
19
0
24 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
26
8
0
23 Apr 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
59
2
0
22 Apr 2022
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Xiyang Dai
...
Jianwei Yang
Haoxuan You
Kai-Wei Chang
Shih-Fu Chang
Lu Yuan
VLM
OffRL
33
22
0
22 Apr 2022
Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
Mustafa Shukor
Guillaume Couairon
Asya Grechka
Matthieu Cord
ViT
49
18
0
20 Apr 2022
K-LITE: Learning Transferable Visual Models with External Knowledge
Sheng Shen
Chunyuan Li
Xiaowei Hu
Jianwei Yang
Yujia Xie
...
Ce Liu
Kurt Keutzer
Trevor Darrell
Anna Rohrbach
Jianfeng Gao
CLIP
VLM
36
83
0
20 Apr 2022
LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation
Qingyu Chen
Jingcheng Du
Alexis Allot
Zhiyong Lu
24
25
0
19 Apr 2022
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Yupan Huang
Tengchao Lv
Lei Cui
Yutong Lu
Furu Wei
37
437
0
18 Apr 2022
Learning to Execute Actions or Ask Clarification Questions
Zhengxiang Shi
Yue Feng
Aldo Lipani
LM&Ro
28
44
0
18 Apr 2022
Visio-Linguistic Brain Encoding
S. Oota
Jashn Arora
Vijay Rowtula
Manish Gupta
R. Bapi
AI4CE
22
15
0
18 Apr 2022
End-to-end Dense Video Captioning as Sequence Generation
Wanrong Zhu
Bo Pang
Ashish V. Thapliyal
William Yang Wang
Radu Soricut
DiffM
19
32
0
18 Apr 2022
Attention Mechanism based Cognition-level Scene Understanding
Xuejiao Tang
Tai Le Quy
LRM
35
0
0
17 Apr 2022
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting
G. Han
Long Chen
Jiawei Ma
Shiyuan Huang
Ramalingam Chellappa
Shih-Fu Chang
VLM
37
20
0
16 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
22
43
0
16 Apr 2022
Vision-and-Language Pretrained Models: A Survey
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
38
63
0
15 Apr 2022
XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding
Chan-Jan Hsu
Hung-yi Lee
Yu Tsao
VLM
32
3
0
15 Apr 2022
WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types
Xuwu Wang
Junfeng Tian
Min Gui
Zhixu Li
Rui Wang
Ming Yan
Lihan Chen
Yanghua Xiao
VGen
38
48
0
13 Apr 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Sanjay Subramanian
William Merrill
Trevor Darrell
Matt Gardner
Sameer Singh
Anna Rohrbach
ObjD
49
126
0
12 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjD
MLLM
VLM
25
49
0
12 Apr 2022
XMP-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font Generation
Wei Liu
Fangyue Liu
Fei Din
Qian He
Zili Yi
VLM
31
38
0
11 Apr 2022
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Shunyu Zhang
X. Jiang
Zequn Yang
T. Wan
Zengchang Qin
29
12
0
10 Apr 2022
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
Yunxing Kang
Tianqiao Liu
Hang Li
Y. Hao
Wenbiao Ding
43
7
0
10 Apr 2022
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation
Jianan Wang
Guansong Lu
Hang Xu
Zhenguo Li
Chunjing Xu
Yanwei Fu
38
17
0
09 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
48
404
0
07 Apr 2022
Temporal Alignment Networks for Long-term Video
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
20
84
0
06 Apr 2022
Domain-Agnostic Prior for Transfer Semantic Segmentation
Xinyue Huo
Lingxi Xie
Hengtong Hu
Wen-gang Zhou
Houqiang Li
Qi Tian
37
29
0
06 Apr 2022
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering
Vipul Gupta
Zhuowan Li
Adam Kortylewski
Chenyu Zhang
Yingwei Li
Alan Yuille
35
44
0
05 Apr 2022
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Roman Bachmann
David Mizrahi
Andrei Atanov
Amir Zamir
51
266
0
04 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
19
17
0
31 Mar 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
39
60
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
35
94
0
30 Mar 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
47
153
0
28 Mar 2022
Few-Shot Object Detection with Fully Cross-Transformer
G. Han
Jiawei Ma
Shiyuan Huang
Long Chen
Shih-Fu Chang
47
132
0
28 Mar 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIP
VLM
36
16
0
27 Mar 2022
Deep Multi-modal Fusion of Image and Non-image Data in Disease Diagnosis and Prognosis: A Review
C. Cui
Haichun Yang
Yaohong Wang
Shilin Zhao
Zuhayr Asad
Lori A. Coburn
K. Wilson
Bennett A. Landman
Yuankai Huo
28
94
0
25 Mar 2022
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Zhou Yu
Zitian Jin
Jun Yu
Mingliang Xu
Hongbo Wang
Jianping Fan
33
4
0
24 Mar 2022
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
Yanyuan Qiao
Yuankai Qi
Yicong Hong
Zheng Yu
Peifeng Wang
Qi Wu
AI4TS
32
71
0
22 Mar 2022
Graph-Text Multi-Modal Pre-training for Medical Representation Learning
Sungjin Park
Seongsu Bae
Jiho Kim
Tackeun Kim
Edward Choi
25
16
0
18 Mar 2022
Previous
1
2
3
...
18
19
20
...
29
30
31
Next