Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 1,509 papers shown
Title
Logically Consistent Loss for Visual Question Answering
Anh-Cat Le-Ngo
T. Tran
Santu Rana
Sunil R. Gupta
Svetha Venkatesh
OOD
24
0
0
19 Nov 2020
Using Text to Teach Image Retrieval
Haoyu Dong
Ze Wang
Qiang Qiu
Guillermo Sapiro
3DV
35
4
0
19 Nov 2020
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Hassan Akbari
Hamid Palangi
Jianwei Yang
Sudha Rao
Asli Celikyilmaz
Roland Fernandez
P. Smolensky
Jianfeng Gao
Shih-Fu Chang
32
3
0
18 Nov 2020
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Andreas Veit
Kimberly Wilber
20
2
0
17 Nov 2020
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions
Jianan Wang
Boyang Albert Li
Xiangyu Fan
Jing-Hua Lin
Yanwei Fu
25
2
0
15 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
49
417
0
14 Nov 2020
Cross-Modality Protein Embedding for Compound-Protein Affinity and Contact Prediction
Yuning You
Yang Shen
25
8
0
14 Nov 2020
Transductive Zero-Shot Learning using Cross-Modal CycleGAN
Patrick Bordes
Éloi Zablocki
Benjamin Piwowarski
Patrick Gallinari
VLM
27
0
0
13 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
21
81
0
10 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
21
94
0
10 Nov 2020
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts
Ece Takmaz
Mario Giulianelli
Sandro Pezzelle
Arabella J. Sinclair
Raquel Fernández
20
26
0
09 Nov 2020
CapWAP: Captioning with a Purpose
Adam Fisch
Kenton Lee
Ming-Wei Chang
J. Clark
Regina Barzilay
8
11
0
09 Nov 2020
Multi-modal, multi-task, multi-attention (M3) deep learning detection of reticular pseudodrusen: towards automated and accessible classification of age-related macular degeneration
Qingyu Chen
T. Keenan
Alexis Allot
Yifan Peng
Elvira Agrón
...
Chantal Cousineau-Krieger
W. Wong
Yingying Zhu
E. Chew
Zhiyong Lu
MedIm
33
20
0
09 Nov 2020
Long Range Arena: A Benchmark for Efficient Transformers
Yi Tay
Mostafa Dehghani
Samira Abnar
Songlin Yang
Dara Bahri
Philip Pham
J. Rao
Liu Yang
Sebastian Ruder
Donald Metzler
53
696
0
08 Nov 2020
Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles
Christopher Clark
Mark Yatskar
Luke Zettlemoyer
26
61
0
07 Nov 2020
Utilizing Every Image Object for Semi-supervised Phrase Grounding
Haidong Zhu
Arka Sadhu
Zhao-Heng Zheng
Ram Nevatia
ObjD
25
7
0
05 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings
Yue Wang
Jing Li
M. Lyu
Irwin King
19
16
0
03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
31
168
0
01 Nov 2020
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View
Yangyang Guo
Liqiang Nie
Zhiyong Cheng
Q. Tian
Min Zhang
19
69
0
30 Oct 2020
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
Stanislav Frolov
Shailza Jolly
Jörn Hees
Andreas Dengel
EGVM
20
5
0
28 Oct 2020
Co-attentional Transformers for Story-Based Video Understanding
Björn Bebensee
Byoung-Tak Zhang
22
5
0
27 Oct 2020
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Aisha Urooj Khan
Amir Mazaheri
N. Lobo
M. Shah
32
56
0
27 Oct 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li
Haoxuan You
Zhecan Wang
Alireza Zareian
Shih-Fu Chang
Kai-Wei Chang
SSL
VLM
72
12
0
24 Oct 2020
Can images help recognize entities? A study of the role of images for Multimodal NER
Shuguang Chen
Gustavo Aguilar
Leonardo Neves
Thamar Solorio
EgoV
45
33
0
23 Oct 2020
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies
Itai Gat
Idan Schwartz
Alex Schwing
Tamir Hazan
60
90
0
21 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
25
6
0
19 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering
Hantao Huang
Tao Han
Wei Han
D. Yap
Cheng-Ming Chiang
21
2
0
17 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning
Wanyun Cui
Guangyu Zheng
Wei Wang
SSL
24
21
0
16 Oct 2020
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Ana Marasović
Chandra Bhagavatula
J. S. Park
Ronan Le Bras
Noah A. Smith
Yejin Choi
ReLM
LRM
18
62
0
15 Oct 2020
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Hao Tan
Joey Tianyi Zhou
CLIP
22
120
0
14 Oct 2020
Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!
Jack Hessel
Lillian Lee
29
72
0
13 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
Fuli Luo
Pengcheng Yang
Shicheng Li
Xuancheng Ren
Xu Sun
VLM
SSL
21
16
0
13 Oct 2020
Contrast and Classify: Training Robust VQA Models
Yash Kant
A. Moudgil
Dhruv Batra
Devi Parikh
Harsh Agrawal
21
5
0
13 Oct 2020
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
Qinxin Wang
Hao Tan
Sheng Shen
Michael W. Mahoney
Z. Yao
ObjD
50
11
0
12 Oct 2020
Beyond Language: Learning Commonsense from Images for Reasoning
Wanqing Cui
Yanyan Lan
Liang Pang
Jiafeng Guo
Xueqi Cheng
LRM
29
5
0
10 Oct 2020
Interpretable Neural Computation for Real-World Compositional Visual Question Answering
Ruixue Tang
Chao Ma
CoGe
19
2
0
10 Oct 2020
ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization
Tzuf Paz-Argaman
Yuval Atzmon
Gal Chechik
Reut Tsarfaty
VLM
32
32
0
07 Oct 2020
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
22
244
0
06 Oct 2020
Pathological Visual Question Answering
Xuehai He
Zhuo Cai
Wenlan Wei
Yichen Zhang
Luntian Mou
Eric Xing
P. Xie
75
24
0
06 Oct 2020
Attention Guided Semantic Relationship Parsing for Visual Question Answering
M. Farazi
Salman Khan
Nick Barnes
19
2
0
05 Oct 2020
Multi-Modal Open-Domain Dialogue
Kurt Shuster
Eric Michael Smith
Da Ju
Jason Weston
AI4CE
41
42
0
02 Oct 2020
Which *BERT? A Survey Organizing Contextualized Encoders
Patrick Xia
Shijie Wu
Benjamin Van Durme
26
50
0
02 Oct 2020
Contrastive Learning of Medical Visual Representations from Paired Images and Text
Yuhao Zhang
Hang Jiang
Yasuhide Miura
Christopher D. Manning
C. Langlotz
MedIm
61
733
0
02 Oct 2020
ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
José Manuél Gómez-Pérez
Raúl Ortega
43
24
0
01 Oct 2020
Learning Object Detection from Captions via Textual Scene Attributes
Achiya Jerbi
Roei Herzig
Jonathan Berant
Gal Chechik
Amir Globerson
27
21
0
30 Sep 2020
Attention that does not Explain Away
Nan Ding
Xinjie Fan
Zhenzhong Lan
Dale Schuurmans
Radu Soricut
27
3
0
29 Sep 2020
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
Xiaowei Hu
Xi Yin
Kevin Qinghong Lin
Lijuan Wang
Lefei Zhang
Jianfeng Gao
Zicheng Liu
VLM
22
56
0
28 Sep 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLM
MLLM
30
102
0
23 Sep 2020
Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering
Tuong Khanh Long Do
Binh X. Nguyen
Huy Tran
Erman Tjiputra
Quang-Dieu Tran
Thanh-Toan Do
23
2
0
23 Sep 2020
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering
Tejas Gokhale
Pratyay Banerjee
Chitta Baral
Yezhou Yang
OOD
22
140
0
18 Sep 2020
Previous
1
2
3
...
27
28
29
30
31
Next