LXMERT: Learning Cross-Modality Encoder Representations from Transformers

20 August 2019

Papers citing "LXMERT: Learning Cross-Modality Encoder Representations from Transformers"

50 / 1,509 papers shown

Title
Logically Consistent Loss for Visual Question Answering Anh-Cat Le-Ngo T. Tran Santu Rana Sunil R. Gupta Svetha Venkatesh OOD 24 0 0 19 Nov 2020
Using Text to Teach Image Retrieval Haoyu Dong Ze Wang Qiang Qiu Guillermo Sapiro 3DV 35 4 0 19 Nov 2020
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language Hassan Akbari Hamid Palangi Jianwei Yang Sudha Rao Asli Celikyilmaz Roland Fernandez P. Smolensky Jianfeng Gao Shih-Fu Chang 32 3 0 18 Nov 2020
Improving Calibration in Deep Metric Learning With Cross-Example Softmax Andreas Veit Kimberly Wilber 20 2 0 17 Nov 2020
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions Jianan Wang Boyang Albert Li Xiangyu Fan Jing-Hua Lin Yanwei Fu 25 2 0 15 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Yi Yang ViT 49 417 0 14 Nov 2020
Cross-Modality Protein Embedding for Compound-Protein Affinity and Contact Prediction Yuning You Yang Shen 25 8 0 14 Nov 2020
Transductive Zero-Shot Learning using Cross-Modal CycleGAN Patrick Bordes Éloi Zablocki Benjamin Piwowarski Patrick Gallinari VLM 27 0 0 13 Nov 2020
Multimodal Pretraining for Dense Video Captioning Gabriel Huang Bo Pang Zhenhai Zhu Clara E. Rivera Radu Soricut 21 81 0 10 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers Zongheng Tang Yue Liao Si Liu Guanbin Li Xiaojie Jin Hongxu Jiang Qian Yu Dong Xu 21 94 0 10 Nov 2020
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts Ece Takmaz Mario Giulianelli Sandro Pezzelle Arabella J. Sinclair Raquel Fernández 20 26 0 09 Nov 2020
CapWAP: Captioning with a Purpose Adam Fisch Kenton Lee Ming-Wei Chang J. Clark Regina Barzilay 8 11 0 09 Nov 2020
Multi-modal, multi-task, multi-attention (M3) deep learning detection of reticular pseudodrusen: towards automated and accessible classification of age-related macular degeneration Qingyu Chen T. Keenan Alexis Allot Yifan Peng Elvira Agrón ... Chantal Cousineau-Krieger W. Wong Yingying Zhu E. Chew Zhiyong Lu MedIm 33 20 0 09 Nov 2020
Long Range Arena: A Benchmark for Efficient Transformers Yi Tay Mostafa Dehghani Samira Abnar Songlin Yang Dara Bahri Philip Pham J. Rao Liu Yang Sebastian Ruder Donald Metzler 53 696 0 08 Nov 2020
Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles Christopher Clark Mark Yatskar Luke Zettlemoyer 26 61 0 07 Nov 2020
Utilizing Every Image Object for Semi-supervised Phrase Grounding Haidong Zhu Arka Sadhu Zhao-Heng Zheng Ram Nevatia ObjD 25 7 0 05 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings Yue Wang Jing Li M. Lyu Irwin King 19 16 0 03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning Simon Ging Mohammadreza Zolfaghari Hamed Pirsiavash Thomas Brox ViT CLIP 31 168 0 01 Nov 2020
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View Yangyang Guo Liqiang Nie Zhiyong Cheng Q. Tian Min Zhang 19 69 0 30 Oct 2020
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis Stanislav Frolov Shailza Jolly Jörn Hees Andreas Dengel EGVM 20 5 0 28 Oct 2020
Co-attentional Transformers for Story-Based Video Understanding Björn Bebensee Byoung-Tak Zhang 22 5 0 27 Oct 2020
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering Aisha Urooj Khan Amir Mazaheri N. Lobo M. Shah 32 56 0 27 Oct 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions Liunian Harold Li Haoxuan You Zhecan Wang Alireza Zareian Shih-Fu Chang Kai-Wei Chang SSL VLM 72 12 0 24 Oct 2020
Can images help recognize entities? A study of the role of images for Multimodal NER Shuguang Chen Gustavo Aguilar Leonardo Neves Thamar Solorio EgoV 45 33 0 23 Oct 2020
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies Itai Gat Idan Schwartz Alex Schwing Tamir Hazan 60 90 0 21 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends Shagun Uppal Sarthak Bhagat Devamanyu Hazarika Navonil Majumdar Soujanya Poria Roger Zimmermann Amir Zadeh 25 6 0 19 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering Hantao Huang Tao Han Wei Han D. Yap Cheng-Ming Chiang 21 2 0 17 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning Wanyun Cui Guangyu Zheng Wei Wang SSL 24 21 0 16 Oct 2020
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs Ana Marasović Chandra Bhagavatula J. S. Park Ronan Le Bras Noah A. Smith Yejin Choi ReLM LRM 18 62 0 15 Oct 2020
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision Hao Tan Joey Tianyi Zhou CLIP 22 120 0 14 Oct 2020
Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! Jack Hessel Lillian Lee 29 72 0 13 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations Fuli Luo Pengcheng Yang Shicheng Li Xuancheng Ren Xu Sun VLM SSL 21 16 0 13 Oct 2020
Contrast and Classify: Training Robust VQA Models Yash Kant A. Moudgil Dhruv Batra Devi Parikh Harsh Agrawal 21 5 0 13 Oct 2020
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding Qinxin Wang Hao Tan Sheng Shen Michael W. Mahoney Z. Yao ObjD 50 11 0 12 Oct 2020
Beyond Language: Learning Commonsense from Images for Reasoning Wanqing Cui Yanyan Lan Liang Pang Jiafeng Guo Xueqi Cheng LRM 29 5 0 10 Oct 2020
Interpretable Neural Computation for Real-World Compositional Visual Question Answering Ruixue Tang Chao Ma CoGe 19 2 0 10 Oct 2020
ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization Tzuf Paz-Argaman Yuval Atzmon Gal Chechik Reut Tsarfaty VLM 32 32 0 07 Oct 2020
Support-set bottlenecks for video-text representation learning Mandela Patrick Po-Yao (Bernie) Huang Yuki M. Asano Florian Metze Alexander G. Hauptmann João Henriques Andrea Vedaldi 22 244 0 06 Oct 2020
Pathological Visual Question Answering Xuehai He Zhuo Cai Wenlan Wei Yichen Zhang Luntian Mou Eric Xing P. Xie 75 24 0 06 Oct 2020
Attention Guided Semantic Relationship Parsing for Visual Question Answering M. Farazi Salman Khan Nick Barnes 19 2 0 05 Oct 2020
Multi-Modal Open-Domain Dialogue Kurt Shuster Eric Michael Smith Da Ju Jason Weston AI4CE 41 42 0 02 Oct 2020
Which *BERT? A Survey Organizing Contextualized Encoders Patrick Xia Shijie Wu Benjamin Van Durme 26 50 0 02 Oct 2020
Contrastive Learning of Medical Visual Representations from Paired Images and Text Yuhao Zhang Hang Jiang Yasuhide Miura Christopher D. Manning C. Langlotz MedIm 61 733 0 02 Oct 2020
ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention José Manuél Gómez-Pérez Raúl Ortega 43 24 0 01 Oct 2020
Learning Object Detection from Captions via Textual Scene Attributes Achiya Jerbi Roei Herzig Jonathan Berant Gal Chechik Amir Globerson 27 21 0 30 Sep 2020
Attention that does not Explain Away Nan Ding Xinjie Fan Zhenzhong Lan Dale Schuurmans Radu Soricut 27 3 0 29 Sep 2020
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning Xiaowei Hu Xi Yin Kevin Qinghong Lin Lijuan Wang Lefei Zhang Jianfeng Gao Zicheng Liu VLM 22 56 0 28 Sep 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Jaemin Cho Jiasen Lu Dustin Schwenk Hannaneh Hajishirzi Aniruddha Kembhavi VLM MLLM 30 102 0 23 Sep 2020
Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering Tuong Khanh Long Do Binh X. Nguyen Huy Tran Erman Tjiputra Quang-Dieu Tran Thanh-Toan Do 23 2 0 23 Sep 2020
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering Tejas Gokhale Pratyay Banerjee Chitta Baral Yezhou Yang OOD 22 140 0 18 Sep 2020