Title
WeaQA: Weak Supervision via Captions for Visual Question Answering Pratyay Banerjee Tejas Gokhale Yezhou Yang Chitta Baral 110 36 0 04 Dec 2020
Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D Ankit Goyal Kaiyu Yang Dawei Yang Jia Deng 91 42 0 03 Dec 2020
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs Emanuele Bugliarello Ryan Cotterell Naoaki Okazaki Desmond Elliott 102 120 0 30 Nov 2020
Language-Driven Region Pointer Advancement for Controllable Image Captioning Annika Lindh R. Ross John D. Kelleher 43 14 0 30 Nov 2020
Self-Supervised Real-to-Sim Scene Generation Aayush Prakash Shoubhik Debnath Jean-Francois Lafleche Eric Cameracci Gavriel State Stan Birchfield M. Law 82 26 0 30 Nov 2020
General Multi-label Image Classification with Transformers Jack Lanchantin Tianlu Wang Vicente Ordonez Yanjun Qi ViT 80 268 0 27 Nov 2020
Road Scene Graph: A Semantic Graph-Based Scene Representation Dataset for Intelligent Vehicles Yafu Tian Alexander Carballo Ruifeng Li K. Takeda GNN 89 27 0 27 Nov 2020
Learning from Lexical Perturbations for Consistent Visual Question Answering Spencer Whitehead Hui Wu Yi R. Fung Heng Ji Rogerio Feris Kate Saenko 68 11 0 26 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation Yicong Hong Qi Wu Yuankai Qi Cristian Rodriguez-Opazo Stephen Gould LM&Ro 128 303 0 26 Nov 2020
Open-Vocabulary Object Detection Using Captions Alireza Zareian Kevin Dela Rosa Derek Hao Hu Shih-Fu Chang VLM ObjD 187 436 0 20 Nov 2020
Classification by Attention: Scene Graph Classification with Prior Knowledge Sahand Sharifzadeh Sina Moayed Baharlou Volker Tresp OCL 76 52 0 19 Nov 2020
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision Yujie Zhong Linhai Xie Sen Wang Lucia Specia Yishu Miao SSL 26 0 0 19 Nov 2020
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions Jianan Wang Boyang Albert Li Xiangyu Fan Jing-Hua Lin Yanwei Fu 46 2 0 15 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Yi Yang ViT 134 423 0 14 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers Zongheng Tang Yue Liao Si Liu Guanbin Li Xiaojie Jin Hongxu Jiang Qian Yu Dong Xu 68 99 0 10 Nov 2020
After All, Only The Last Neuron Matters: Comparing Multi-modal Fusion Functions for Scene Graph Generation Mohamed Karim Belaid 89 1 0 09 Nov 2020
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts Ece Takmaz Mario Giulianelli Sandro Pezzelle Arabella J. Sinclair Raquel Fernández 98 26 0 09 Nov 2020
CapWAP: Captioning with a Purpose Adam Fisch Kenton Lee Ming-Wei Chang J. Clark Regina Barzilay 53 11 0 09 Nov 2020
Dual ResGCN for Balanced Scene GraphGeneration Jingyi Zhang Yong Zhang Baoyuan Wu Yanbo Fan Fumin Shen Heng Tao Shen 67 12 0 09 Nov 2020
An Improved Attention for Visual Question Answering Tanzila Rahman Shih-Han Chou Leonid Sigal Giuseppe Carenini 44 45 0 04 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings Yue Wang Jing Li Michael R. Lyu Irwin King 75 16 0 03 Nov 2020
Diverse Image Captioning with Context-Object Split Latent Spaces Shweta Mahajan Stefan Roth 64 42 0 02 Nov 2020
RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering Zanxia Jin Heran Wu Chun Yang Fang Zhou Jingyan Qin Lei Xiao Xu-Cheng Yin 88 31 0 24 Oct 2020
Show and Speak: Directly Synthesize Spoken Description of Images Xinsheng Wang Siyuan Feng Jihua Zhu M. Hasegawa-Johnson O. Scharenborg 152 4 0 23 Oct 2020
Learning Dual Semantic Relations with Graph Attention for Image-Text Matching Keyu Wen Xiaodong Gu Qingrong Cheng 76 97 0 22 Oct 2020
Contextual Heterogeneous Graph Network for Human-Object Interaction Detection Hai Wang Weishi Zheng Yingbiao Ling 88 88 0 20 Oct 2020
Language and Visual Entity Relationship Graph for Agent Navigation Yicong Hong Cristian Rodriguez-Opazo Yuankai Qi Qi Wu Stephen Gould LM&Ro 226 134 0 19 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering Hantao Huang Tao Han Wei Han D. Yap Cheng-Ming Chiang 28 4 0 17 Oct 2020
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision Hao Tan Joey Tianyi Zhou CLIP 89 121 0 14 Oct 2020
Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! Jack Hessel Lillian Lee 108 75 0 13 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations Fuli Luo Pengcheng Yang Shicheng Li Xuancheng Ren Xu Sun VLM SSL 73 16 0 13 Oct 2020
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video Cristian Rodriguez-Opazo Edison Marrese-Taylor Basura Fernando Hongdong Li Stephen Gould 192 10 0 13 Oct 2020
Webly Supervised Image Classification with Metadata: Automatic Noisy Label Correction via Visual-Semantic Graph Jingkang Yang Weirong Chen Xue Jiang Xiaopeng Yan Huabin Zheng Wayne Zhang NoLa 74 13 0 12 Oct 2020
Beyond Language: Learning Commonsense from Images for Reasoning Wanqing Cui Yanyan Lan Liang Pang Jiafeng Guo Xueqi Cheng LRM 71 5 0 10 Oct 2020
Background Learnable Cascade for Zero-Shot Object Detection Ye Zheng Ruoran Huang Chuanqi Han Xi Huang Li Cui ObjD 123 48 0 09 Oct 2020
Dense Relational Image Captioning via Multi-task Triple-Stream Networks Dong-Jin Kim Tae-Hyun Oh Jinsoo Choi In So Kweon 115 27 0 08 Oct 2020
Pathological Visual Question Answering Xuehai He Zhuo Cai Wenlan Wei Yichen Zhang Luntian Mou Eric Xing P. Xie 140 24 0 06 Oct 2020
Attention Guided Semantic Relationship Parsing for Visual Question Answering M. Farazi Salman Khan Nick Barnes 40 2 0 05 Oct 2020
Multi-Modal Open-Domain Dialogue Kurt Shuster Eric Michael Smith Da Ju Jason Weston AI4CE 137 44 0 02 Oct 2020
CAPTION: Correction by Analyses, POS-Tagging and Interpretation of Objects using only Nouns L. Ferreira Douglas De Rizzo Meneghetti P. Santos 21 2 0 02 Oct 2020
Learning Object Detection from Captions via Textual Scene Attributes Achiya Jerbi Roei Herzig Jonathan Berant Gal Chechik Amir Globerson 79 21 0 30 Sep 2020
Attention that does not Explain Away Nan Ding Xinjie Fan Zhenzhong Lan Dale Schuurmans Radu Soricut 54 3 0 29 Sep 2020
Spatial Attention as an Interface for Image Captioning Models P. Sadler 51 0 0 29 Sep 2020
Addressing Class Imbalance in Scene Graph Parsing by Learning to Contrast and Score He Huang Shunta Saito Yuta Kikuchi Eiichi Matsumoto Wei Tang Philip S. Yu 36 5 0 28 Sep 2020
Human-Object Interaction Detection:A Quick Survey and Examination of Methods T. Bergstrom Humphrey Shi ObjD 36 12 0 27 Sep 2020
SceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors Mohammad Keshavarzi Aakash Parikh Xiyu Zhai Melody Mao Luisa Caldas An Yang 79 24 0 25 Sep 2020
Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases Gerhard Weikum Luna Dong Simon Razniewski Fabian M. Suchanek 144 128 0 24 Sep 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Jaemin Cho Jiasen Lu Dustin Schwenk Hannaneh Hajishirzi Aniruddha Kembhavi VLM MLLM 95 102 0 23 Sep 2020
Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering Tuong Khanh Long Do Binh X. Nguyen Huy Tran Erman Tjiputra Quang-Dieu Tran Thanh-Toan Do 40 2 0 23 Sep 2020
ALICE: Active Learning with Contrastive Natural Language Explanations Weixin Liang James Zou Zhou Yu VLM 105 51 0 22 Sep 2020