Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 1,512 papers shown
Title
Focusing On Targets For Improving Weakly Supervised Visual Grounding
V. Pham
Nao Mishima
ObjD
31
1
0
22 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
53
205
0
20 Feb 2023
Interactive Video Corpus Moment Retrieval using Reinforcement Learning
Zhixin Ma
Chong-Wah Ngo
35
3
0
19 Feb 2023
VLN-Trans: Translator for the Vision and Language Navigation Agent
Yue Zhang
Parisa Kordjamshidi
42
16
0
18 Feb 2023
CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension
Zhi Zhang
H. Yannakoudakis
Xiantong Zhen
Ekaterina Shutova
29
2
0
17 Feb 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Zhihong Chen
Shizhe Diao
Benyou Wang
Guanbin Li
Xiang Wan
MedIm
37
30
0
17 Feb 2023
Retrieval-augmented Image Captioning
R. Ramos
Desmond Elliott
Bruno Martins
VLM
36
29
0
16 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
39
7
0
16 Feb 2023
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Jiang Liu
Hui Ding
Zhaowei Cai
Yuting Zhang
R. Satzoda
Vijay Mahadevan
R. Manmatha
ObjD
23
122
0
14 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
42
40
0
14 Feb 2023
Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data
Ryumei Nakada
Halil Ibrahim Gulluk
Zhun Deng
Wenlong Ji
James Zou
Linjun Zhang
SSL
VLM
42
37
0
13 Feb 2023
HateProof: Are Hateful Meme Detection Systems really Robust?
Piush Aggarwal
Pranit Chawla
Mithun Das
Punyajoy Saha
Binny Mathew
Torsten Zesch
Animesh Mukherjee
AAML
39
8
0
11 Feb 2023
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT Images
Gary Y. Li
Junyu Chen
Se-In Jang
Kuang Gong
Quanzheng Li
ViT
MedIm
51
14
0
08 Feb 2023
Self-supervised Multi-view Disentanglement for Expansion of Visual Collections
Nihal Jain
Praneetha Vaddamanu
Paridhi Maheshwari
Vishwa Vinay
K. Kulkarni
22
3
0
04 Feb 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Fan Liu
Liqiang Nie
Mohan S. Kankanhalli
40
10
0
04 Feb 2023
Controlling for Stereotypes in Multimodal Language Model Evaluation
Manuj Malik
Richard Johansson
36
1
0
03 Feb 2023
CLIPood: Generalizing CLIP to Out-of-Distributions
Yang Shu
Xingzhuo Guo
Jialong Wu
Ximei Wang
Jianmin Wang
Mingsheng Long
OODD
VLM
59
73
0
02 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
46
161
0
01 Feb 2023
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
42
26
0
01 Feb 2023
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Ying Jin
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
ViT
30
38
0
31 Jan 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
322
4,300
0
30 Jan 2023
Debiased Fine-Tuning for Vision-language Models by Prompt Regularization
Beier Zhu
Yulei Niu
Saeil Lee
Minhoe Hur
Hanwang Zhang
VLM
VPVLM
32
22
0
29 Jan 2023
Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking
Mubashara Akhtar
O. Cocarascu
Elena Simperl
26
25
0
27 Jan 2023
Lexi: Self-Supervised Learning of the UI Language
Pratyay Banerjee
Shweti Mahajan
Kushal Arora
Chitta Baral
Oriana Riva
36
17
0
23 Jan 2023
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Razvan-George Pasca
Alexey Gavryushin
Muhammad Hamza
Yen-Ling Kuo
Kaichun Mo
Luc Van Gool
Otmar Hilliges
Xi Wang
38
14
0
22 Jan 2023
MultiNet with Transformers: A Model for Cancer Diagnosis Using Images
H. Barzekar
Yash J. Patel
L. Tong
Zeyun Yu
MedIm
32
6
0
21 Jan 2023
Screen Correspondence: Mapping Interchangeable Elements between UIs
Jason Wu
Amanda Swearngin
Xiaoyi Zhang
Jeffrey Nichols
Jeffrey P. Bigham
48
7
0
20 Jan 2023
A Survey of research in Deep Learning for Robotics for Undergraduate research interns
P. NarayananP.
Palacode Narayana Iyer Anantharaman
6
1
0
19 Jan 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Floris Weers
Vaishaal Shankar
Angelos Katharopoulos
Yinfei Yang
Tom Gunter
CLIP
25
4
0
19 Jan 2023
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss
Xiaofeng Yang
Fayao Liu
Guosheng Lin
VLM
26
7
0
18 Jan 2023
Curriculum Script Distillation for Multilingual Visual Question Answering
Khyathi Raghavi Chandu
A. Geramifard
35
0
0
17 Jan 2023
It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers
Ana-Maria Bucur
Adrian Cosma
Paolo Rosso
Liviu P. Dinu
46
34
0
13 Jan 2023
Text to Point Cloud Localization with Relation-Enhanced Transformer
Guangzhi Wang
Hehe Fan
Mohan S. Kankanhalli
3DPC
36
15
0
13 Jan 2023
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning
Zhenfang Chen
Qinhong Zhou
Songlin Yang
Yining Hong
Hao Zhang
Chuang Gan
LRM
VLM
44
36
0
12 Jan 2023
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Xinsong Zhang
Yan Zeng
Jipeng Zhang
Hang Li
VLM
AI4CE
LRM
37
17
0
12 Jan 2023
Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering
Paul Lerner
O. Ferret
C. Guinaudeau
29
9
0
11 Jan 2023
Universal Multimodal Representation for Language Understanding
Zhuosheng Zhang
Kehai Chen
Rui Wang
Masao Utiyama
Eiichiro Sumita
Z. Li
Hai Zhao
SSL
24
21
0
09 Jan 2023
MAQA: A Multimodal QA Benchmark for Negation
Judith Yue Li
Aren Jansen
Qingqing Huang
Joonseok Lee
Ravi Ganti
Dima Kuzmin
33
5
0
09 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
45
7
0
05 Jan 2023
SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph
Yuxing Long
Binyuan Hui
Fulong Ye
Yanyang Li
Zhuoxin Han
Caixia Yuan
Yongbin Li
Xiaojie Wang
LLMAG
40
7
0
05 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
37
15
0
05 Jan 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLM
AI4TS
188
70
0
30 Dec 2022
On Transforming Reinforcement Learning by Transformer: The Development Trajectory
Shengchao Hu
Li Shen
Ya Zhang
Yixin Chen
Dacheng Tao
OffRL
35
25
0
29 Dec 2022
Prototype-guided Cross-task Knowledge Distillation for Large-scale Models
Deng Li
Aming Wu
Yahong Han
Qingwen Tian
VLM
35
2
0
26 Dec 2022
Do DALL-E and Flamingo Understand Each Other?
Hang Li
Jindong Gu
Rajat Koner
Sahand Sharifzadeh
Volker Tresp
MLLM
23
12
0
23 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
21
241
0
21 Dec 2022
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
Jiaxian Guo
Junnan Li
Dongxu Li
A. M. H. Tiong
Boyang Albert Li
Dacheng Tao
Steven C. H. Hoi
VLM
MLLM
37
108
0
21 Dec 2022
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Zhiyang Xu
Ying Shen
Lifu Huang
MLLM
34
110
0
21 Dec 2022
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Rohan Pandey
Rulin Shao
Paul Pu Liang
Ruslan Salakhutdinov
Louis-Philippe Morency
34
12
0
20 Dec 2022
Efficient Image Captioning for Edge Devices
Ning Wang
Jiangrong Xie
Hangzai Luo
Qinglin Cheng
Jihao Wu
Mingbo Jia
Linlin Li
VLM
CLIP
30
20
0
18 Dec 2022
Previous
1
2
3
...
12
13
14
...
29
30
31
Next